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ABSTRACT 



Quantum theory forbids physical measurements from giving observers enough evidence to dis- 
tinguish nonorthogonal quantum states — this is an expression of the indeterminism inherent in 
all quantum phenomena. As a consequence, operational measures of how distinct one quantum 
state is from another must depend exclusively on the statistics of measurements and nothing else. 
This compels the use of various information-theoretic measures of distinguishability for probability 
distributions as the starting point for quantifying the distinguishability of quantum states. The 
process of translating these classical distinguishability measures into quantum mechanical ones is 
the focus of this dissertation. The measures so derived have important applications in the young 
fields of quantum cryptography and quantum computation and, also, in the more established field 
of quantum communication theory. 

Three measures of distinguishability are studied in detail. The first — the statistical overlap 
or fidelity — upper bounds the decrease (with the number of measurements on identical copies) of 
the probability of error in guessing a quantum state's identity. The second — the KuUback-Leibler 
relative information — quantifies the distinction between the frequencies of measurement outcomes 
when the true quantum state is one or the other of two fixed possibilities. The third — the mutual 
information — is the amount of information that can be recovered about a state's identity from a 
measurement; this quantity dictates the amount of redundancy required to reconstruct reliably a 
message whose bits are encoded by quantum systems prepared in the specified states. For each of 
these measures, an optimal quantum measurement is one for which the measure is as large or as 
small (whichever is appropriate) as it can possibly be. The "quantum distinguishability" for each 
of the three measures is its value when an optimal measurement is used for defining it. Generally 
all these quantum distinguishability measures correspond to different optimal measurements. 

The results reported in this dissertation include the following. An exact expression for the 
quantum fidelity is derived, and the optimal measurement that gives rise to it is studied in detail. 
The techniques required for proving this result are very useful and may be applied to other, quite 
different problems. Several upper and lower bounds on the quantum mutual information are derived 
via similar techniques and compared to each other. Of particular note is a simplified derivation 
for the important upper bound first proved by Holevo in 1973. An explicit expression is given for 
another (tighter) upper bound that appears implicitly in the same derivation. Several upper and 
lower bounds to the quantum Kullback information are also derived. Particular attention is paid 
to a technique for generating successively tighter lower bounds, contingent only upon one's ability 
to solve successively higher order nonlinear matrix equations. 

In the final Chapter, the distinguishability measures developed here are applied to a question 
at the foundation of quantum theory: to what extent must quantum systems be disturbed by 
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information gathering measurements? This is tackled in two ways. The first is in setting up a 
general formalism for ferreting out the tradeoff between inference and disturbance. The main 
breakthrough in this is that it gives a way of expressing the problem so that it appears as algebraic 
as that of the problem of finding quantum distinguishability measures. The second result on this 
theme is the proof of a theorem that prohibits "broadcasting" an unknown quantum state. That is 
to say, it is proved that there is no way to replicate an unknown quantum state onto two separate 
quantum systems when each system is considered without regard to the other (though there may 
well be correlation or quantum entanglement between the systems). This result is a significant 
extension and generalization of the standard "no-cloning" theorem for pure states. 
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Chapter 1 



Prolegomenon 



"The information's unavailable to 
the mortal man." 

— Paul Simon 
Slip Slidin' Away 

1.1 Introduction 

Suppose a single photon is secretly prepared in one of two known but nonorthogonal linear polar- 
ization states eb and ei . A fundamental consequence of the indeterminism inherent in all quantum 
phenomena is that there is no measurement whatsoever that can discriminate which of the two 
states was actually prepared. For instance, imagine an attempt to ascertain the polarization value 
by sending the photon through a beam splitter; the photon will either pass straight through or 
be deflected in a direction dependent upon the crystal's orientation n. In this example, the only 
means available for the photon to "express" the value of its polarization is through the quantum 
mechanical probability law for it to go straight 

Pi = \ei-n\'^, i = 0,l. (1.1) 

Since the polarization vectors are nonorthogonal, there is no orientation n that can assure that 
only one preparation will pass straight through while the other is deflected. To the extent that one 
can gain information by sampling a probability distribution, one can also gain information about 
the preparation, but indeed only to that extent. If the photon goes straight through, one might 
conjecture that pi is closer to 1 than not, and thus that the actual polarization is the particular 
Si most closely aligned with the crystal orientation n, but there is no clean-cut certainty here. 
Ultimately one must make do with a guess. The necessity of this guess is the unimpeachable 
signature of quantum indeterminism. 

Fortunately, quantum phenomena are manageable enough that we are allowed at least the handle 
of a probability assignment in predicting their behavior. The world is certainly not the higgledy- 
piggledy place it would be if we were given absolutely no predictive power over the phenomena we 
encounter. This fact is the foundation for this dissertation. It provides a starting point for building 
several notions of how distinct one quantum state is from another. 

Since there is no completely reliable way to identify a quantum state by measurement, one 
cannot simply reach into the space of quantum states, place a yardstick next to the line connecting 
two of them, and read off a distance. Similarly one cannot reach into the ethereal space of prob- 
ability assignments. Nevertheless classical information theory does give several ways to gauge the 



1 



distinction between two probability distributions. The idea of the game of determining a quantum 
distinguishabihty measure is to start with one of the ones specified by classical information the- 
ory. The probabilities appearing in it are assumed to be generated by a measurement on a system 
described by the quantum states that one wants to distinguish. The quantum distinguishabihty is 
defined by varying over all possible measurements to find the one that makes the classical distin- 
guishabihty the best it can possibly be. The best classical distinguishabihty found in this search is 
dubbed the quantum measure of distinguishabihty. 

Once reasonable measures of quantum distinguishabihty are in hand, there are a great number 
of applications in which they can be used. To give an example — one to which we will return in the 
final chapter — step back to 1927, the year quantum mechanics became a relatively stable theory. 
The question that was the rage was why an electron could not be ascribed a classical state of motion. 
The answer — so the standard story of that year and thereafter goes — is that whenever one tries to 
discern the position of an electron, one necessarily disturbs its momentum in an uncontrollable way. 
Similarly, whenever one tries to discern the momentum, the position is necessarily disturbed. One 
can never get at both quantities simultaneously. Thus there are no means to specify operationally 
a phase space trajectory for the electron. 

Yet if one looks carefully at the standard textbook Heisenberg relation of Robertson the 
first one derived without recourse to semiclassical thought experiments, one finds nothing even 
remotely resembling this picture. What is found instead is that when many copies of a system 
are all prepared in the same quantum state ip{x), if one makes enough position measurements on 
the copies to get an estimate of Ax, and similarly makes enough momentum measurements on 
(different!) copies to get Ap, then one can be assured that the product of the numbers so found 
will be no smaller than h/2. Any attempt to extend the meaning of this relation beyond this is 
dangerous and unwarranted. 

Nevertheless there is most certainly truth to the idea that information gathering measurements 
in quantum theory necessarily cause disturbances. This is one of the greatest distinctions between 
classical physics and quantum physics, and is — after all — the ultimate reason that quantum systems 
cannot be ascribed classical states of motion. It is just that this is not captured properly by the 
Heisenberg relation. What is needed, among other things, are two ingredients considered in detail 
in this dissertation. The first is a notion of the information that can be gained about the identity of 
quantum state from a given measurement model. The second is a way of comparing the quantum 
state that describes a system before measurement to the quantum state that describes it after 
measurement — in short, one of the quantum distinguishabihty measures described above. 

This gives some hint of the wealth that may lie at the end of the rainbow of quantum dis- 
tinguishabihty and accessible information. First, however, there are many rainbows to be made 
[See Fig. 3.2|, and this is the focus of the work reported here. In the remainder of this Chapter, 
we summarize the main results of our research and describe the salient features of probabilities, 
quantum states, and quantum measurements that will be the starting point for the later chapters. 
In Chapter 2, we describe in great detail the motivations for and derivations of several classical 
measures of distinguishabihty for probability distributions. Many of the things presented there 
are not well known to the average physicist, but little of the work represents original research. In 
Chapter 3 — the main chapter of new results — we report everything we know about the quantum 
mechanical versions of the classical distinguishabihty measures introduced in Chapter 2. Chapter 
4 applies the methods and measures developed in Chapter 3 to the deeper question of the tradeoff 
between information gathering and disturbance in quantum theory. The first section of Chapter 4 
is devoted to developing a general formalism for tackling questions of this ilk. The second section 
(which represents a collaboration with H. Barnum, C. M. Caves, R. Jozsa, and B. Schumacher) 
proves that there is a very general sense in which it is impossible to make a copy of an unknown 
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quantum state. This is a result that extends the now-standard "no-cloning" theorem for pure 
quantum states |§, ^. Chapter 5 caps off the dissertation with a comprehensive bibliography of 
527 books and articles relevant to quantum distinguishability and quantum state disturbance. 

1.2 Summary of Results 

Mutual Information 

As stated already, the main focus of this work is in deriving distinguishability measures for quantum 
mechanical states. First and foremost in importance among these is the one defined by the question 
of accessible information. In this problem's simplest form, one considers a binary communication 
channel — i.e., a channel in which the alternative transmissions are and 1. The twist in the 
quantum mechanical case is that the and 1 are encoded physically as distinct states po and pi 
of some quantum system (described on a D-dimensional Hilbert space, D being finite); the states 
Po and pi need not be orthogonal nor pure states for that matter. The idea here is to think of 
the transmissions more literally as "letters" that are mere components in much longer words and 
sentences, i.e., the meaningful messages to be sent down the channel. When the quantum states 
occur with frequencies vro and vri, the channel encodes, according to standard information theory, 

i7(7r) = -(vTologavro -FvTilogavri) (1.2) 

bits of information per transmission. This scenario is of interest precisely because of the way in 
which quantum indeterminism bars any receiver from retrieving the full information encoded in the 
individual transmissions: there is generally no measurement with outcomes uniquely corresponding 
to whether or 1 was transmitted. 

The amount of information that is recoverable in this scheme is quantified by an expression 
called the mutual information. This quantity can be understood as follows. When the receiver 
performs a measurement to recover the message, he initially sees the quantum system neither in 
state Po nor in state pi but rather in the mean quantum state 

p = ttqpo + TTipi ; (1.3) 

this is an expression of the fact that he does not know which message was sent. Therefore the raw 
information he gains upon measuring some nondegenerate Hermitian operator M, say, 

M = Y,mh\b){b\ , (1.4) 
b 

is the Shannon information 

Hip) = -Y,{b\p\b)^og,{b\p\b) (1.5) 

b 

of the distribution p{b) = {b\p\b) for the outcomes b. This raw information, however, is not solely 
about the transmission; for, even if the receiver knew the system to be in one state over the other, 
measuring M would still give him a residual information gain quantified by the Shannon formula 
for the appropriate distribution, 

poib) = {b\po\b) or p,ib) = {b\pi\b) . (1.6) 

The residual information gain is due to the fact that the measurement outcome b is not determin- 
istically predictable even with the transmission, or 1, known. Given this, the mutual information 
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J, or information concerning the transmission, is just what one would expect intuitively: the raw 
information gain minus the average residual gain. That is to say, 

J = H{p)-7ToH{po)-7r^H{pi) . (1.7) 

The uniquely quantum mechanical question that arises in this context is which measurement maxi- 
mizes the mutual information for this channel and what is that maximal amount; this is the question 
of accessible information in quantum theory. The value of the maximal information, /, is called 
the accessible information. 

Until recently, little has been known about the accessible information of a general quantum 
communication channel. The trouble is in not having sufficiently powerful techniques for searching 
over all possible quantum measurements, i.e., not only those measurements corresponding to Her- 
mitian operators, but also the more general positive-operator-valued measures (POVMs) with any 
number of outcomes. Aside from a few contrived examples in which the accessible information can 
be calculated explicitly, the most notable results have been in the form of bounds — bounds on the 
number of outcomes in an optimal measurement Q and bounds on the accessible information itself 

&!• 

Much of the goal of my own research on this question, reported in Section has been to 
improve upon these bounds and ultimately to find a procedure for successively approximating, to 
any desired degree of accuracy, the question's solution. In this respect, some progress has been 
made. The Holevo upper bound Q to the accessible information 

/ < S{p) - TToSipo) - TTiSipi) , (1.8) 

where S{p) = — tr(/3log2 p) is the von Neumann entropy of the density operator p, and the Jozsa- 
Robb-Wootters lower bound 

I > Qip) - noQipo) - TTiQipi) , (1.9) 

where Q{p) is a complicated quantity called the sub-entropy of p, are both the best bounds express- 
ible solely in terms of the channel's mean density operator p when po and pi are pure states. To go 
beyond them, one must include more details about the actual ensemble from which the signals are 
drawn, i.e., details about the prior probabilities and the density operators po and pi themselves. 

Along these lines, several new tighter ensemble-dependent bounds are reported. Two of these — 
an upper and a lower bound — are found in the process of simplifying the derivation of the Holevo 
upper bound (an exercise useful in and of itself). The lower bound of this pair takes on a particularly 
pleasing form when po and pi have a common null subspace: 

7rotr(polog2£p(/5o)) + vri tr(pi log2 £p(/5i)^ </, (1.10) 

where the Cp{pi), i = 0, 1, are operators that satisfy the anti-commutator equation 

pCp{pi) + Cp{pi)p = 2 Pi , (1.11) 

which has a solution in this case. It turns out that there is actually a measurement that generates 
this bound and — at least for two-dimensional Hilbert spaces — it is often very close to being optimal. 
When pq and pi are pure states this lower bound actually equals the accessible information. 

Another significant bound on the accessible information comes about by thinking of the states 
Pq and pi as arising from a partial trace over some larger Hilbert space with two possible pure 
states \iPq) and iV'i) on it. For any two such purifications the corresponding accessible information 
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will be larger than that for the original mixed states; this is because by expanding a Hilbert space 
one can only add distinguishability. However, the accessible information for two pure states can 



be calculated exactly; it is given by the appropriate analog to Eq. (|1.10). Using a result for the 



maximal possible overlap between two purifications due to Uhlmann |7|, one arrives at the tightest 
possible upper bound of this form. 

Statistical Overlap 



The second most important way of defining a notion of statistical distinguishability (see Section 3.3 ) 
concerns the following scenario. One imagines a finite number of copies, A^, of a quantum system 
secretly prepared in the same quantum state — either the state po or the state pi . It is the task of an 
observer to perform the same measurement on each of these copies and then, based on the collected 
data, to make a guess as to the identity of the quantum state. Intuitively, the more "distinguishable" 
the quantum states, the easier the observer's task, but how does one go about quantifying such a 
notion? The idea is to rely on the probability of error Pe{N) in this inference problem to point 
the way. For instance, one might take the probability of error itself as a measure of statistical 
distinguishability and define the quantum distinguishability to be that quantity minimized over all 
possible quantum measurements. Appealing as this may be, a difficulty crops up in that the optimal 
quantum measurement in this case explicitly depends on the number of measurement repetitions. 

To get past this blemish, one can ask instead which measurement will cause the error probability 
to decrease exponentially as fast as possible in A^; that is to say, what is the smallest A for which 

Pe{N) < . (1.12) 

Classically, the best bound of this form is called the Chernoff bound, and is given by 

A= min Vpo(6)>i(6)'-°, (1-13) 

b 

where po{b) and pi{b) are the probability distributions for the measurement outcomes. The ex- 
ponential rate of decrease in error probability. A, optimized over all measurements must be, by 
definition, independent of the number of measurement repetitions, and thus makes a natural (oper- 
ationally defined) candidate for a measure of quantum distinguishability. Unfortunately the search 
for an explicit expression for this quantity remains a subject for future research. On the brighter 
side, however, there is a closely related upper bound on the Chernoff bound that is of interest in 
its own right — the statistical overlap 

If a measurement generates probabilities po{b) and pi{b) for its outcomes, then the statistical 
overlap between these distributions is defined to be 



i"(po,Pi) = EVPo(^V^- (1-14) 

b 

This quantity, as stated, gives a more simply expressed upper bound on A. This is quite important 
because this expression is manageable enough that it actually can be optimized to give a useful 
quantum distinguishability measure (even though it may not be as well-founded conceptually as A 
itself). The optimal statistical overlap is given by 



F{po,Pi) =tT\Jpl^^popy'^ = tr V po^^pipy^ , (1.15) 
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a quantity known as the fidelity for quantum states, which has appeared in other mathematical 
physics contexts 0, ^. (Here we start the trend in notation that the same function is used to 
denote both the classical distinguishability and its quantum version; notice that the former has 
the probabihty distributions as its argument and the latter has the density operators themselves.) 
This measure of distinguishability has many useful properties and is crucial to the proof of "no- 
broadcasting" in Chapter 4. 

Of particular note is the way the technique used in finding the quantum fidelity reveals the 
actual measurement by which Eq. ( [1.14 ) is minimized. This measurement (i.e., orthonormal basis) 



is the one specified by the Hermitian operator 



ji> .-1/2 /.1/2. .1/2,-1/2 

when pi is invertible. This technique comes in very handy for the problems considered in Chapter 
4. 

Kullback-Leibler Relative Information 

The final information theoretic measure of distinguishability we consider can be specified by the 
following problem (see Section |3.6| ) . Suppose N^l copies of a quantum system are all prepared in 
the same state pi. If some observable is measured on each of these, the most likely frequencies for 
its various outcomes b will be those given by the probabilities pi{b) assigned by quantum theory. All 
other frequencies beside this "natural" set will become less and less likely for large N as statistical 
fluctuations in the frequencies eventually damp away. In fact, any set of outcome frequencies 
{/(6)} — distinct from the "natural" ones {pi{b)} — will become exponentially less likely with the 
number of measurements according to [|10| 



PROB(freq = {/(6)} prob = {r(6)}) ^ e-^^^f^^^^ , (1.17) 

where 

K{f/pi) = Y^f{b)\JI^\ (1.18) 



is the Kullback-Leibler relative information |11| between the frequency distribution f{b) and the 
probability distribution pi{h). Therefore the quantity K{f/pi), which controls the behavior of 
this exponential decline, says something about how dissimilar the frequencies {f{b)} are from the 
"natural" ones {pi{b)}. This gives an easy way to gauge the distinguishability of two probability 
distributions, as will be seen presently. 

Suppose now that the same measurements as above are performed on quantum systems all 
prepared in the state po- The outcome frequencies most likely to appear in this scenario are 
again those specified by the probability distribution given by quantum theory — in this case po{b). 
This simple fact points to a natural way to define an optimal measurement for this problem. 
An optimal measurement is one for which the natural frequencies of outcomes for state pQ are 
maximally improbable, given that pi is actually controlling the statistics. That is to say, an optimal 
measurement is one for which 

K{po/pi) = Y.Poib)ln(^) (1.19) 



b 



is as large as it can be. The associated quantum measure of distinguishability, called the quantum 
Kullback information, is just that maximal value [O]. 
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The quantum Kullback information is much hke the accessible information in that it contains a 
nasty logarithm in its definition. As such we have only been able to find bounds on this quantity. 
Two of the lower bounds take on particularly pretty forms: 

KFiPo/pi) = tr(^/5oln(£^,(/5o))) , (1.20) 

and 

Kb{po/pi) ^2tr(^poln(p-^/VpfpoPi/'/5r'/')) • (1-21) 

A very general procedure has also been found for generating successively tighter lower bounds than 
these. This procedure is, however, contingent upon one's ability to solve higher and higher order 
nonlinear matrix equations. Several upper bounds to the quantum Kullback information are also 
reported. 

Inference Disturbance Tradeoff for Quantum States 

With some results on the quantum distinguishability measures in hand, we turn our attention 
to the sorts of problems in which they may be used. A typical one is that already described 
in the Introduction: how might one gauge the necessary tradeoff between information gain and 
disturbance in quantum measurement? Not many results have yet poured from this direction, but 
we are able to go a long way toward defining the problem in its most general setting. The main 
breakthrough is in realizing a way to express the problem so that it becomes as algebraic as that 
of finding explicit formulae for the quantum distinguishability measures. 

The No- Broadcasting Theorem 

Suppose a quantum system, secretly prepared in one state from the set A = {po, pi}, is dropped 
into a "black box" whose purpose is to broadcast or replicate that quantum state onto two separate 
quantum systems. That is to say, a state identical to the original should appear in each system 
when it is considered without regard to the other (though there may be correlation or quantum 
entanglement between the systems). Can such a black box be built? 

The "no-cloning theorem" |^, ^] insures that the answer to this question is no when the states 
in A are pure and nonorthogonal; for the only way to have each of the broadcast systems described 
separately by a pure state \ip) is for their joint state to be ^ \ip)- When the states are mixed, 
however, things are not so clear. There are many ways each broadcast system can be described by 
p without the joint state being p® p, the mixed state analog of cloning. The systems may also be 
entangled or correlated in such a way as to give the correct marginal density operators. 

For instance, consider the limiting case in which po and pi commute and so may be thought of 
as probability distributions po{h) and pi{b) for their eigenvectors. In this case, one easily sees that 
the states can be broadcast; the broadcasting device need merely perform a measurement of the 
eigenbasis and prepare two systems, each in the state corresponding to the outcome it finds. The 
resulting joint state is not of the form p® p but still reproduces the correct marginal probability 
distributions and thus, in this case, the correct marginal density operators. 

It turns out that two states pQ and pi can be broadcast if and only if they commute.]^ The 
way this is demonstrated is via a use of the quantum fidelity derived in Chapter 3. One can show 
that the most general process or "black box" allowed by quantum mechanics can never increase 

^This finding represents a collaboration with H. Barnum, C. M. Caves, R. Jozsa, and B. Schumacher. 
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the distinguishability of quantum states (as measured by fidelity); yet broadcasting requires that 
distinguishability actuahy increase unless the quantum states commute. 

This theorem is important because it draws a communication theoretic distinction between 
commuting and noncommuting density operators. This is a distinction that has only shown up 
before in the Holevo bound to accessible information: the bound is achieved by the accessible infor- 
mation if and only if the signal states commute. The no-broadcasting result also has implications 
for quantum cryptography. 



1.3 Essentials 

What are probability distributions? What are quantum systems? What are quantum states? 
What are quantum measurements? We certainly cannot answer these questions in full in this short 
treatise. However we need at least working definitions for these concepts to make any progress 
at all in our endeavors. This Section details some basic ideas and formalism used throughout the 
remainder of the dissertation. Also it lays the groundwork for some of the ideas presented in the 
Postscript. 



1.3.1 What Is a Probability? 



In this document, we hold fast to the Bayesian view of probability 14 1. This is that a prob- 
ability assignment summarizes what one does and does not know about a particular situation. A 
probability represents a state of knowledge; its numerical value quantifies the plausibility one is 
willing to give a hypothesis given some background information. 

This point of view should be contrasted with the idea that probability must be identified with 
the relative frequency of the various outcomes in an infinite repetition of a given experiment. The 



difficulties with the frequency theory of probabilities are numerous |15| and need not be repeated 



here. Suffice it to point out that if one takes the frequency idea seriously then one may never apply 
the probability calculus to situations where an experiment cannot — by definition — be repeated more 
than once. For instance, I say that the probability that my heart will stop beating at 10:29 this 
evening is about one in a million. There can be no real infinite ensemble of repetitions for this 
experiment. Alternatively, if I must construct an "imaginary" conceptual ensemble to understand 
the probability statement, then why bother? Why not call it a degree of belief to begin with? 
The Bayesian point of view should also be contrasted with the propensity theory of probability 



| 16|| . This is the idea that a probability expresses an objective tendency on the part of the exper- 
imental situation to produce one outcome over the other. If this were the case, then one could 
hardly apply the probability calculus to situations where the experimental outcome already exists 
at the time the experiment is performed. For instance, I give it an 98% chance there will be a 
typographical error in this manuscript the night I defend it. But surely there either will or will not 
be such an error on the appropriate night, independently of the probability I have assigned. 

One might argue that probabilities in quantum mechanics are something different from the 
Bayesian sort [17, IS, |l^, perhaps more aligned with the propensity idea [20, 21|. This idea is 
taken to task in Ref. [^ . The argument in a nutshell is that if it looks like a Bayesian probability 
and smells like a Bayesian probability, then why draw a distinction where none is to be found. 

With all that said and done, what is a probability? Formally, it is the plausibility P{H\S) of a 
hypothesis H given some background information S and satisfies the probability calculus: 

• P{H\S) > 0, 

• P{H\S) + P{^H\S) = 1, where signifies the negation of H, and 
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• plausibilities are updated according to Bayes' rule upon the acquisition of new evidence E, 

P(^f|«.5)^^»ifi^, (1.22) 

when it is clear that the background information is not changed by the process of acquiring 
the evidence. 

These properties are essential for the work presented here and will be used over and over again. 
1.3.2 What Is a Quantum State? 

We take a very pragmatic approach to the meaning of a quantum state in this dissertation. Quantum 
states are the formal devices we use to describe what we do and do not know about a given 



quantum system — nothing more, nothing less |2^, 24, 25, In this sense, quantum states are 
quite analogous to the Bayesian concept of probability outlined above. (Though more strictly 
speaking, they form the background information S that may be used in a Bayesian probability 
assignment.) 

What is a quantum system? To say it once evocatively, a line drawn in the sand. A quantum 
system is any part of the physical world that we may wish to conceptually set aside for consideration. 
It need not be microscopic and it certainly need not be considered in complete isolation from the 
remainder of our conceptual world — it may interact with an environment. 

Mathematically speaking, in this dissertation what we call a quantum state is any density 
operator p over a D-dimensional Hilbert space, where D is finite. The properties of a density 
operator are that it be Hermitian with nonnegative eigenvalues and that its trace be equal to unity. 
A very special class of density operators are the one-dimensional projectors p = These 
are called pure states as opposed to density operators of higher rank, which are often called mixed 
states. 



Pure states correspond to states of maximal knowledge in quantum mechanics [22|. Mixed 
states, on the other hand, correspond to less than maximal knowledge. This can come about in 
at least two ways. The first is simply by not knowing — to the extent that one could in principle — 
the precise preparation of a quantum system. The second is in having maximal knowledge about a 
composite quantum system, i.e., describing it via a pure state \ip){ip\ on some tensor-product Hilbert 
space Til Ti.2, but restricting one's attention completely to a subsystem of the larger system: 
quantum theory generally requires that one's knowledge of a subsystem be less than maximal even 
though maximal knowledge has been attained concerning the composite system. Formally, the 
states corresponding to the subsystems are given by tracing out the irrelevant Hilbert space. That 
is to, if |lj)|2cj) is a basis for Tii Ti.2, and 

IV) =^Ci„|li)|2„) (1.23) 

ia 

then the state on subsystem 1 is 

pi = tr2(|V')(V|) 

= ^Q,c:„|l,)(l,| . (1.24) 
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A similar mixed state comes about when attention is restricted to subsystem 2. Calling the re- 
duced density operator the quantum state of the subsystem ensures that the form of the probability 
formula for measurement outcomes will remain the same in going from composite system to sub- 
system. 



1.3.3 What Is a Quantum Measurement? 

We shall be concerned with a very general notion of quantum measurement throughout this dis- 
sertation, and it will serve us well to make the ideas plain at the outset. We take as a quantum 
measurement any physical process that can be used on a quantum system to generate a probability 
distribution for some "outcomes." 

To make this idea rigorous, we recall two standard axioms for quantum theory. The first is 
that when the conditions and environment of a quantum system are completely specified and no 
measurements are being made, then the system's state evolves according to the action of a unitary 
operator U, 

p — > UpW . (1.25) 

The second is that a repeatable measurement corresponds to some complete set of projectors II;, 
(not necessarily one-dimensional) onto orthogonal subspaces of the quantum system's Hilbert space. 
The probabilities of the outcomes for such a measurement are given according to the von Neumann 
formula, 

p{b) = ti{pUb) . (1.26) 

With these two basic facts, we can lay out the structure of a general quantum measurement. 

The most general action that can be performed on a quantum system to generate a set of 



"outcomes" is |Q 



1. to allow the system to be placed in contact with an auxiliary system or ancilla |27] prepared 
in a standard state, 

2. to have the two evolve unitarily so as to become correlated or entangled, and then 

3. to perform a repeatable measurement on the ancilla. 

One might have thought that a measurement on the composite system as a whole — in the last stage 
of this — could lead to a more general set of measurements, but, in fact, it cannot. For this can 
always be accomplished in this scheme by a unitary operation that first swaps the system's state 
into a subspace of the ancilla's Hilbert space and then proceeds as above. 

More formally, these steps give rise to a probability distribution in the following way. Suppose 
the system and ancilla are initially described by the density operators p^ and pa respectively. The 
conjunction of the two systems is then described by the initial quantum state 

Psa = Ps O Pa • (1-27) 

Then the unitary time evolution leads to a new state, 

Psa UpsJj^ . (1.28) 

Finally, a reproducible measurement on the ancilla is described via a set of orthogonal projection 
operators {t ® 11;,} acting on the ancilla's Hilbert space, where 1 is the identity operator. Any 
particular outcome h is found with probability 

p{h) = ti(u{p,®p^)U\i®fik)) . (1.29) 
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The number of outcomes in this generahzed notion of measurement is hmited only by the dimen- 
sionahty of the ancilla's Hilbert space — in principle, there can be arbitrarily many. There are no 
limitations on the number of outcomes due to the dimensionality of the system's Hilbert space. 

It turns out that this formula for the probabilities can be reexpressed in terms of operators on 
the system's Hilbert space alone. This is easy to see. If we let \sa) and la^) be an orthonormal 
basis for the system and ancilla respectively, \sa)\ac) will be a basis for the composite system. Then 



using the cyclic property of the trace in Eq. ( 1.29 ), we get 



Pib) = ^(s«|(ac|((/5s®/6a)f/'^(i®nb)C/)|s„)|ac) 

ac 

= 5](Sa|ps(5Z(«c|((i®Pa)?7^(i®nb);7)|ac))|Sa) . (1.30) 



It follows that we may write 
where 



p{b) = trs(ps^fe) , (1.31) 



^fe = tra((i®/5a)?7(i®n6)?7t) (1.32) 

is an operator that acts on the Hilbert space of the original system only. Here tra and trg denote 
partial traces over the system and ancilla Hilbert spaces, respectively. 

Note that the Eb are positive operators, i.e., Hermitian operators with nonnegative eigenvalues, 
usually denoted 

Eb>0, (1.33) 

because they are formed from the partial trace of the product of two positive operators. Moreover, 
these automatically satisfy a completeness relation of a sort, 

Y,Eb = i. (1.34) 

b 

These two properties taken together are the defining properties of something called a Positive 
Operator- Valued Measure or POVM. Sets of operators {Eb} satisfying this are so called because 
they give an obvious (mathematical) generalization of the probability concept. As opposed to a 
complete set of orthogonal projectors, the POVM elements Eb need not commute with each other. 

A theorem, originally due to Neumark [p8|, that is very useful for our purposes is that any 
POVM can be realized in the fashion of Eq. ( |1.32[ ). This allows us to make full use of the defining 
properties of POVMs in setting up the optimization problems considered here. Namely, whenever 
we wish to optimize something like the mutual information, say, over all possible quantum measure- 
ments, we just need write the expression in terms of a POVM {Eb} and optimize over all operator 
sets satisfying Eqs. ( |1.33 ) and ( 1.34 ). 



Why need we go to such lengths to describe a more general notion of measurement than the 
standard textbook one? The answer is simple: because there are situations in which the repeat- 
able measurements, i.e., the orthogonal projection- valued measurements {11;,}, are just not general 
enough to give the optimal measurement. The paradigmatic example of such a thing ||2^ is that of 
a quantum system with a 2-dimensional Hilbert space — a spin-^ particle say — that can be prepared 
in one of three pure states, all with equal prior probabilities. If the possible states are each 120° 
apart on the Bloch sphere, then the measurement that optimizes the information recoverable about 
the quantum state's identity is one with three outcomes. Each outcome excludes one of the possi- 
bilities, narrowing down the state's identity to one of the other two. Therefore this measurement 
cannot be described by a standard two-outcome orthogonal projection- valued measurement. 
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Chapter 2 



The Distinguishability of Probability 
Distributions 

"... and as these are the only ties of our 
thoughts, they are really to us the cement 
of the universe, and all the operations of 
the mind must, in a great measure, depend 
on them." 

— David Hume 
An Abstract of a 
Treatise of Human Nature 

2.1 Introduction 

The idea of distinguishing probability distributions is slippery business. What one means in say- 
ing "two probability distributions are distinguishable" depends crucially upon one's prior state 
of knowledge and the context in which the probabilities are applied. This chapter is devoted to 
developing three quantitative measures of distinguishability, each tailor-made to a distinct problem. 

To make firm what we strive for in developing these measures, one should keep at least one or the 
other of two models in mind. The first, very concrete, one is the model of a noisy communication 
channel. In this model, things are very simple. A sender prepares a simple message, or 1, 
encoded in distinct preparations of a physical system, with probability vro and vri. A receiver 
performs some measurement on the system he receives for which the measurement outcomes b have 
conditional probability po{b) or pi{b), depending upon the preparation of the system. The idea 
is that a notion of distinguishability of probability distributions should tell us something about 
what options are available to the receiver once he collects his data — what sort of inferences or 
estimates the receiver may make about the sender's actual preparation. The more distinguishable 
the probability distributions, the more the receiver's data should give him some insight (in senses 
to be defined) about the actual preparation. 

The second, more abstract, model is that of the "quantum-information channel" |3^, ^] — 
though at this level applied within a purely classical context. Here we start with a physical system 
that we describe via some probability distribution po{b), and we wish to transpose that state of 
knowledge onto another physical system in the possession of someone with which we may later 
communicate. This transposition process must by necessity be carried out by some physical means, 
even if only by actually transporting our system to the other person. The only difference between 
this and a standard communication transaction is that no mention will be made as to what the 
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receiver of the "signal" will actually find if he makes an observation on it. Rather we shall be 
concerned with what we, the senders, can predict about what he would find in an observational 
situation. This distinction is crucial, for it leads to the following point. 

If the state of knowledge is left truly intact during the transposition process, then the system 
possessed by the receiver will be described by the sender via the initial probability distribution 
Po{b). If on the other hand — through lack of absolute control of the physical systems concerned or 
through the intervention of a third party — the state of knowledge changes during the transposition 
attempt, the receiver's system will have to be described by some different distribution pi{b). We 
might even imagine a certain probability ttq that the transposition will be faithful and a probability 
VTi that it will be imperfect. The question we should like to address is to what extent the states of 
knowledge po{b) and pi{b) can be distinguished one from the other given these circumstances. In 
what quantifiable and operational sense can one state be said to be distinct from the other? 

This language, it should be noted, makes no mention of probability distributions being either 
true or false. Neither po{b) nor pi{b) need have anything to do with the receiver's subjective 
expectation for his observational outcomes; the focus here is completely upon the distinction in 
predictions the sender may make under the opposing circumstances and how well he himself might 
be able to check that something did or did not go amiss during the transposition process. 

In the following chapters these measures will be applied to the quantum mechanical context, 
where states of knowledge are more completely represented by density operators po and pi. For 
the time being, however, it may be useful — though not necessary — to think of the distributions as 
arising from some fixed measurement POVM Ej, via 

Po{b) = tvipoEh) and pi{b) = tvipiEb) ■ (2.1) 

This representation, of course, will be the starting point for the quantum mechanical considerations 
of the next chapter. 

2.2 Distinguishability via Error Probability and the Chernoff Bound 

Perhaps the simplest way to build a notion of distinguishability for probability distributions is 
through a simple decision problem. In this problem, an observer blindly samples once from either 
the distribution po{b) or the distribution pi(6), 6 = 1, . . . , n; at most he might know prior probabili- 
ties TTo and TTi for which distribution he samples. If the distributions are distinct, the outcome of the 
sampling will reveal something about the identity of the distribution from which it was drawn. An 
easy way to quantify how distinct the distributions are comes from imagining that the observer must 
make a guess or inference about the identity after drawing his sample. The observer's best-possible 
probability of error in this game says something about the distinguishability of the distributions 
PQ{b) and pi{b) with respect to his prior state of knowledge (as encapsulated in the distribution 
{vTcTTi}). This idea we develop as our first quantitative measure of distinguishability. Afterward 
we generalize it to the possibility of many samplings from the same distribution. This gives rise to 
a measure of distinguishability associated with the exponential decrease of error probability with 
the number of samplings, the Chernoff bound. 

2.2.1 Probability of Error 

What is the best possible probability of error for the decision problem? This can be answered easily 
enough by manipulating the formal definition of error probability. Let us work at this immediately. 
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A decision function is any function 



6:{l,...,n}^{0,l} (2.2) 

representing the method of guess an observer might use in this problem. The probabihty that such 
a guess wiU be in error is 

Pe{6) = TToP{6 = 1 1 0) + 7riP{6 = 1 1) , (2.3) 

where P{5 = 1 1 0) denotes the probabihty that the guess is pi{b) when, in fact, the distribution 
drawn from is reahy po{b). Similarly P{6 = | 1) denotes the probability that the guess is po(b) 
when, in fact, the distribution drawn from is really pi{b). 

A natural decision function is the one such that or 1 is chosen according to which has the 
highest posterior probability, given the sampling's outcome b. Since the posterior probabilities are 
given by Bayes' Rule, 

pm = ^= , (2.4) 

P{b) TToPoib) + TTlPlib) ^ ^ 

where i = 0,1 and 

p{b) = Tropo{b) + TTipiib) (2.5) 

is the total probability for outcome b, this decision function is called Bayes' decision function. 
Symbolically, Bayes' decision function translates into 



if vropo(^) > -n-iPiib) 

5B{b) = { 1 if vripi(6) > 7ropo(^) 

anything if TToPo{b) = TTipi{b) 



(2.6) 



(When the posterior probabilities are equal, it makes no difference which guessing method is used.) 
It turns out, not unexpectedly, that this decision method is optimal as far as error probability is 
concerned |32]. This is seen easily. (In this Chapter, we denote the beginning and ending of proofs 
by A and □, respectively.) 

A Note that for any decision procedure 6, Eq. (|2.3D can be rewritten as 



Pei5) = 7roY.5{b)poib) + 7ri^[l-,5(6)bi(6) , (2.7) 



b=i b=i 



because J2b^i^)Poi^) total probability of guessing 1 when the answer is and J2b[^~^i^)]Pii^) 

is the total probability of guessing when the answer is 1. Then it follows that 

n 

Pei6) - Pe{5B) = - <5B(6))(vroPo(fe) - vripi(6)) . (2.8) 

6=1 

Suppose 5 ^ 5b- Then the only nonzero terms in this sum occur when 5{b) ^ 5B{b) and TTopo{b) ^ 
7ripi(6). Let us consider these terms. When 6{b) = and (5b (^) = 1, vropo(^) — T^iPiib) < 0; thus 
the term in the sum is positive. When 6{b) = 1 and 5b(^) = 0, we have '/ropo(^) — "^iPiib) > 0, and 
again the term in the sum is positive. Therefore it follows that 

Pe{S) > Pe{6B) , (2.9) 

for any decision function 6 other than Bayes'. This proves Bayes' decision function to be optimal. 
□ 
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We shall hereafter denote the probability of error with respect to Bayes' decision method simply 
by Pe- This quantity is expressed more directly by noticing that, when the outcome b is found, the 
probability of a correct decision is just max{p(0|6),p(l|6)}. Therefore 



Pe = 5]p(6)(l-max{p(0|6),p(l|6) 

6=1 

n 

= 5^p(6)min{p(0|6),p(l|6)} (2.10) 

6=1 

n 

= ^min{7roPo(&),vripi(6)} . (2.11) 



Eq. ( |2.10D follows because, for any b, p{0\b)+p{l\b) = 1. Equation ( p. 11 ) gives a concrete expression 



for the sought after measure of distinguishability: the smaller this expression is numerically, the 
more distinguishable the two distributions are. 



Notice that Eq. ( 2.11 ) depends explicitly on the observer's subjective prior state of knowledge 
through TTo and vri and is not solely a function of the probability distributions to be distinguished. 
This dependence is neither a good thing nor a bad thing, for, after all, Bayesian probabilities 
are subjective notions to begin with: they are always defined with respect to someone's state of 
knowledge. One need only be aware of this extra dependence. 

The main point of interest for this measure of distinguishability is that it is operationally 
defined and can be written in terms of a fairly simple expression — one expressed in terms of the 
first power of the probabilities. However, one should ask a few simple immediate questions to 
test the robustness of this concept. For instance, why did we not consider two samplings before 
a decision was made? Indeed, why not three or four or more? If the error probabilities in these 
scenarios lead to nothing new and interesting, then one's work is done. On the other hand, if 
such cases lead to seemingly different measures of distinguishability, then the foundation of this 
approach might require examination. 

These questions are settled by an example due to Cover |Q . Consider the following four different 
probability distributions over two outcomes: po = {.96, .04}, pi = {.04, .96}, qo = {.90, .10}, and 



qi = {0, 1}. Let us compare the distinguishability of po and pi via Eq. ( 2.11 ) to that of go and qi, 
both under the assumption of equal prior probabilities: 

1 1 

P^{pQ,p^) = -min{.96, .04} + -min{.04, .96} = .04 , (2.12) 

and 

Pe{qo,qi) = \ min{.90, 0} + ^ min{.10, 1} = .05 . (2.13) 

Therefore 

Pe(pO,Pl) <^e(g0,9l) , (2.14) 

and so, by this measure, the distributions po and p\ are more distinguishable from each other than 
the distributions q^ and q\. 

On the other hand, consider modifying the scenario so that two samples are taken before a 
guess is made about the identity of the distribution. This scenario falls into the same framework 
as before, only now there are four possible outcomes to the experiment. These must be taken into 
account in calculating the Bayes' decision rule probability of error. Namely, in obvious notation, 

Pe[vl-,v\) = ^ min{.96 x .96, .04 x .04} + ^ min{.96 x .04, .04 x .96} 



15 



1 1 
+- mm{.04 X .96, .96 x .04} + - min{.04 x .04, .96 x .96} 

= .04 , (2.15) 

and 

Pe{ql, ql) = ^ mm{.90 X .90, x 0} + ^ min{.90 x .10, x 1} 

1 r . 1 r 

+ - mm{.10 X .90, 1x0} + - min{.10 x .10, 1x1} 



.005 . (2.16) 



Therefore 



Pe{qlqj)<Pe{plpi). (2.17) 

The distributions qo and qi are actually more distinguishable from each other than the distributions 
Po and pi, when one allows two samplings into the decision problem. 

This example suggests that the probability of error, though a perfectly fine measure of distin- 
guishability for the particular problem of decision-making after one sampling, still leaves something 
to be desired. Even though it is operationally defined, it is not a measure that adapts easily to fur- 
ther data acquisition. For this one needs a measure that is not explicitly tied to the exact number 
of samplings in the decision problem. 

2.2.2 The Chernoff Bound 

The optimal probability of error in the decision problem — the one given by using Bayes' decision 
rule — must decrease toward zero as the the number of samplings increases. This is intuitively clear. 
The exact form of that decrease, however, may not be so obvious. It turns out that the decrease 
asymptotically approaches an exponential in the number of samples drawn before the decision. The 



particular value of this exponential is called the Chernoff bound p4| , pq , IC] because it is not only 
achieved asymptotically, but also envelopes the true decrease from above. 

The Chernoff bound thus forms an attractive notion of distinguishability for probability dis- 
tributions: the faster two distributions allow the probability of error to decrease to zero in the 
number of samples, the more distinguishable the distributions are. It is operationally defined by 
being intimately tied to the decision problem. Yet it neither depends on the prior probabilities 
ttq and vri nor on the number of samples drawn before the decision. The formal statement of the 
Chernoff bound is given in the following theorem. 

Theorem 2.1 (Chernoff) Let Pe{N) be the probability of error for Bayes' decision rule after 
sampling N times one of the two distributions po{b) or pi(b). Then 

Pe{N) < (2.18) 

where 

X = mmFa{po/pi) , (2.19) 

and 

n 

FaiPo/pi) = E^'o(&)X&)'-" , (2.20) 

6=1 

for a restricted to be between and 1. Moreover this bound is approached asymptotically in the 
limit of large N . 
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A Let us demonstrate the first part of tliis tlieorem. Denote the outcome of the fc'th trial by 
bk- Then the two probabiUty distributions for the outcomes of a string of N trials can be written 

Po{bib2...bN) =Po{bi)po{b2)---Pi{bN) , (2.21) 

and 

Piibib2...bN) =Pi{bi)pi{b2)---pi{bN) ■ (2.22) 
Now note that, for any two positive numbers a and b and any < a < 1, 

min{a, b} < a'^b^"'' . (2.23) 

This follows easily. First suppose a <b. Then, because 1 — a > 0, we know 

So 

min{a, b} = a < = a'^b^''^ (2.25) 

Alternatively, suppose b < a; then 

> 1 , (2.26) 



and 

min{a, b} = b<b(^] = a'^b^^'^ . (2.27) 



Putting the notation and the small mathematical fact from the last paragraph together, we 
obtain that for any a G [0, 1], 

Pe{N) = ^ min{7roPo(fti^2 • . . ^Af), vroPo(^i^2 • . . ^Af)} 

< ^0%'"" E P0{bib2...bNrP0{bib2...bNf''' 

= vTo"^!"" E i\{p^{hrpi{bkY-A 

bib2...bN\k=l / 
N / n 

= <^^"n E Po(6.)>i(6fc)'-" 
k=l \bk=l 

= <vr|-"(^X:^'o(6)>i(6)'"") 

/ n \N 

< (^E^'o(&)>i(&)'-"j • (2.28) 



The tightest bound of this form is found by further minimizing the right hand side of Eq. ( 2.28| ) over 
a. This completes the proof that the Chernoff bound is indeed a bound on Pe{N). The remaining 
part of the theorem, that the bound is asymptotically achieved, is more difficult to demonstrate 
and we shall not consider it here; see instead Ref. [10|. □ 
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The value a* of a that achieves the minimum in Eq. ( p.l9|) generally cannot be expressed any 
more explicitly than there. This is because to find it in general one must solve a transcendental 
equation. A notable exception is when the probability distributions pq = {q, 1 — q} and pi = 
{r, 1 — r} are distributions over two outcomes. Let us write out this special case. Here 



FaiPo,Pi) = q'^r^-'' + (1 - qT{l - r) 



(2.29) 



Setting the derivative of this (with respect to a) equal to zero, we find that the optimal a must 
satisfy 



hence 



a 



In 



In 



/(l-r) 



-(1 



'(1-0 



l-a 



In 



1 



r(l 



n -1 



In 



With this and a lot of algebra, one can show 



\iiFc''{po,pi) 



-X In 



1 — r ln(l — q) — ln(l — r) 
r In g — In r 



1-x 

1 - X 

1 — r 



(2.30) 
(2.31) 



-X ln( — ) — (1 — x) In 



(1 — x) In 



where 



In 



r 1 



n -1 



In 



1 — r 
l-a 



(2.32) 



(2.33) 



That is to say, using an expression to be introduced in Section 2.3, the optimal In Fct{po,pi) is 
given by the Kullback-Leibler relative information [0] between the distribution pQ (or pi) and the 
"distribution" {x, 1 — x}. 

This property, relating the Chernoff bound to a Kullback-Leibler relative information is more 
generally true and worth mentioning. The precise statement is the following theorem; details of its 
proof may be found in Ref. [p^ ]. 

Theorem 2.2 (Chernoff) The constant A in the Chernoff bound can also be expressed as 

X = K{pa'/po) = K{pa^,/pi) , (2.34) 

where K{pa/pQ) is the Kullback-Leibler relative information between the distributions Pa{b) and 
Po{b), 



Po{b)J ' 



b=l 

and the distribution pa{b) — depending upon the parameter a — is defined by 

„ f,. ^ Po(6)>i(6)^-" 
^ E6Po(6)"Pi(6)i-" ■ 

The particular value of a used in this, i.e., a* , is the one for which 

Kipa'/Po) = K{pa*/pi) . 



(2.35) 



(2.36) 



(2.37) 
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Statistical Overlap 



Because the Chernoff bound is generally hard to get at analytically, it is worthwhile to reconsider 



the other bounds arising in the derivation of Theorem 2.1. These are all quantities of the form 



FaiPo/Pi 



6=1 



1-a 



(2.38) 



for < a < 1. We shall call the functions appearing in Eq. ( 2.3§| ) the Renyi overlaps of order a 
because of their close connection to the relative information of order a introduced by Renyi 



,37], 



1 / " 

Ka{po/pi) = In Poib)yi{b)'-'' 



(2.39) 



Each Fa{po/pi) forms a notion of distinguishability in its own right, albeit not as operationally 
defined as the Chernoff bound. All these quantities are bounded between and 1 — reaching the 
minimum of if and only if the distributions do not overlap at all, and reaching 1 if and only if the 
the distributions are identical. For a fixed a, the smaller Fa{po/pi) is, the more distinguishable the 
distributions are. Moreover each of these can be used for generating a valid notion of nearness — 
more technically, a topology Isl 



-on the set of probability distributions |39, 40 1. 
A particular Renyi overlap that is quite useful to study because of its many pleasant properties 
is the one of order a 



1 

2' 



Fipo,Pi) 



n 

E 

6=1 



Po{b)\Jpi{b) 



(2.40) 



We shall dub this measure of distinguishability the statistical overlap or fidelity. It has had a long 
and varied history, being rediscovered in different contexts at different times by Bhattacharyya 
H, Jeffreys |3|, Rao @, Renyi ||], Csiszar ||], and Wootters |, |6|. Perhaps its most 
compelling foundational significance is that the quantity 



Dipo/Pi 



cos-'[Y.\/Po(.b)^Piib) 

\ 6=1 / 



(2.41) 



corresponds to the geodesic distance between po{b) and pi{b) on the probability simplex when its 
geometry is specified by the Riemannian metric 

\2 



ds' 



E 

6=1 



mb)r 

p{b) 



(2.42) 



This metric is known as the Fisher information metric p7| , [To| ] and is useful because it appears in 
expressions for the decrease (with the number of samplings) of an estimator's variance in maximum 
likelihood parameter estimation. 

Unfortunately, F{po/pi) does not appear to be strictly tied to any statistical inference prob- 
lem or achievable resource bound as are most other measures of distinguishability studied in this 
Chapter (in particular the probability of error, the Chernoff bound, the Kullback-Leibler relative in- 
formation, the mutual information, and the Fisher information)]^ However, it nevertheless remains 
an extremely useful quantity mathematically as will be seen in the following chapters. Moreover it 
remains of interest because of its intriguing resemblance to a quantum mechanical inner product 
1 45, ^ 46]: it is equal to the sum of a product of "amplitudes" (square roots of probabilities) just 
as the quantum mechanical inner product is. 

^Imre Csiszar, private communication. 
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2.3 Distinguishability via "Keeping the Expert Honest" 

We have already agreed that probabihties must be interpreted as subjective states of knowledge. 
How then can we ever verify whether a probability assignment is "true" or not? Simply put, we 
cannot. We can never know whether a source of probability assignments, such as a weatherman or 
economic forecaster, is telling the truth or not. The best we can hope for in a situation where we 
must rely on someone else's probability assignment is that there is an effective strategy for inducing 
him to tell the truth, an effective strategy to steer him to be as true to his state of knowledge as 
possible when disseminating what he knows. This is the problem of "keeping the expert honest" 
||49|, |50|, |53] . The resolution of this problem gives rise to another measure of distinguishability 



for probability distributions, the Kullback-Leibler relative information |11|. 

Let us start this section by giving a precise statement of the honest-expert problem. Suppose 
an expert's knowledge of some state of affairs is quantified by a probability distribution po{b), 
b = 1, . . . , n, and he is willing to communicate that distribution for a price. If we agree to pay for 
his services, then, barring the use of lie detector tests and truth serums, we can never know whether 
we got our money's worth in the deal. There is no way to tell just by looking at the outcome of an 
experiment whether the distribution po{b) represents his true state of knowledge or whether some 
other distribution pi{b) does. The only thing we can do to safeguard against dishonesty is to agree 
exclusively to payment schemes that somehow build in an incentive for the expert to be honest. 

Imagine the expert agrees to the following payment scheme. If the expert gives a probability 
distribution pi{b), then after we perform an experiment to elicit the actual state of affairs, he will 
be paid an amount that depends upon which outcome occurs and the distribution pi{b). This 
particular type of payment is proposed because, though probabilities do not dictate the outcomes 
of an experiment, the events themselves nevertheless do give us an objective handle on the problem. 

Say, for instance, we pay an amount Ff,{pi{b)) if outcome b actually occurs, where 

Fb:[0,l]^]R , b = l,...,n (2.43) 

is some fixed set of functions independent of the probabilities under consideration. Depending 
upon the form of the functions F^, it may well be in the expert's best interest to lie in reporting his 
probabilities. That is to say, if the expert's true state of knowledge is captured by the distribution 
Po{b), his expected earnings for reporting the distribution pi{b) will be 

n 

F = J2po{b)Ft{pi{b)) . (2.44) 
6=1 

Unless his expected earnings turn out to be less upon lying than in telling the truth, i.e., 

n n 

J2poib)FMb)) < Y.Po(b)FMb)) , (2.45) 

6=1 6=1 

there is no incentive for him to be honest (that is, if the expert acts rationally!). In this context, 
the problem of "keeping the expert honest" is that of trying to find a set of functions Ff, for which 
Eq. ( 2.45 ) remains true for all distributions po{b) and pi{b), b = 1, . . . ,n. 



If such a program can be carried out, then it will automatically give a measure of distinguisha- 
bility for probability distributions. Namely, the difference between the maximal expected payoff 
and the expected payoff for a dishonesty, 

n 

K{po/pi) = Y.Poib) [hiPoib)) - h{pi{b))] , (2.46) 
6=1 
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becomes an attractive notion of distinguishability. (The tildes over the Fj, in this formula signify 
that they are functions optimal for keeping the expert honest.) This quantity captures the idea that 
the more distinguishable two probability distributions are, the harder it should be for an expert to 
pass one off for the other. In this case, the bigger the expert's lie, the greater the expected loss he 
will have to take in giving it. 

Of course, one could consider using any payoff function whatsoever in Eq. (2.46) and calling the 



result a measure of distinguishability, but that would be rather useless. Only a payment scheme 
optimal for this problem is relevant for distilling a probability distribution's identity. Only this sort 
of payment scheme has a useful interpretation. 

Nevertheless, one would be justified in expecting even more from a measure of distinguishabil- 
ity. For instance, the honest-expert problem does not, at first sight, appear to restrict the class 
of optimal functions very tightly at all. For instance, one might further want a measure of dis- 
tinguishability that attains its minimal value of zero if and only if pi{b) = poib). Or one might 
want it to have certain concavity properties. Interestingly enough, these sorts of things are already 
assured — though not explicit — in the posing of the honest-expert problem. 

2.3.1 Derivation of the Kullback-Leibler Information 



Let us use the work of Aczel [^] to demonstrate an exact expression for Eq. (|1|). When n > 3 



it turns out that, up to various constants, there is a unique function satisfying Eq. ( p.45| ) for all 
distributions po{b) and pi{b). We shall consider this case first by proving the following theorem. 

Theorem 2.3 (Aczel) Let n > 3. Then the inequality 

n n 

Y.PkFk{qk) <Y.PkFk{Pk) (2.47) 

k=l k=l 

is satisfied for all n-point probability distributions {pi, . . . ,Pn) and (qi, ... , q„) if and only if there 
exist constants a and 71 , . . . , 7„ such that 

Fk{p) = alnp + -fk, (2.48) 

for all k = 1,2, ... ,n. 



A The main point of this theorem is that Eq. ( 2.45| ), though appearing quite imprecise, is 



tremendously restrictive. We shall presently establish the "only if" part of the theorem. To do this 
we assume the inequality to hold and focus on the functions -Fi and F2. This can be carried out by 
restricting the distributions {pi, . . . ,pn) and (gi, . . . , g„) to be such that qi = pi for all i > 3 while 
pi = p, qi = q, P2 and q2 otherwise remain free. Then we can define r=p + p2 = q + q2 where also 

n n 

r = l-^Pfe = l-^gfc. (2.49) 

i=3 i=3 



Note that because n > 3, r is a number strictly less than unity. With these definitions, Eq. ( p. 45]) 
reduces to 

pFi{q) + (r -p)F2{r - q) < pFi{p) + (r - p)F2{r - p) , (2.50) 

since all the other terms for A; > 3 cancel. This new inequality already contains within it enough 
to show that Fi and F2 are monotonically nondecreasing and differentiable at every point in their 
domain. Let us work on showing these properties straight away. 
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Eq. (2.50) can be rearranged to become 

p [Flip) - Fiiq)] >{r-p) [F2{r - q) - F2(r - p)] . (2.51) 
Upon interchanging the symbols p and the same reasoning also gives rise to the inequaUty, 

q [Fi{q) - Flip)] > {r - q) [F^ir - p) - F^ir - q)] . (2.52) 
Multiplying Eq. ( 2.51| ) by ir — q), Eq. ( |2.52| ) by (r — p), and adding the resultants, we get 

ir-q)p [Fi ip) - Fi (g)] + (r - p) g [Fi iq) - Fi (p)] > , (2.53) 

which implies 



rip-q)[Fiip)-Fiiq)] > 



(2.54) 



Then if p > q, it must be the case that Fi{p) > Fiiq) so that this inequality is maintained. It 
follows that Fi is a monotonically nondecreasing function. 

Now we must show the same property for F2. To do this, we instead multiply Eq. ( p. 51 ) by q, 
Eq. (2.52) by p, and add the results of these operations. This gives 



> qir-p)[F2ir 



F2ir -p)] + pir -q) [F2ir - p) - F2(r - q)] , 



or, after rearranging. 



0<r[ir-p)-ir- q)] [F2(r - p) - F2(r - q)] . 



(2.55) 



(2.56) 



So that, if ir — p) > ir — q), we must have in like manner F2ir — p) > F2ir — q) to maintain the 
inequality. Therefore, F2 must also be a monotonically nondecreasing function. 

Putting these two facts to the side for the moment, we presently seek a tighter relation between 
the functions Fi and F2. To this end, we multiply Eq. (2.51) by q, Eq. ( p.52|) by —p. This gives 



pq[Fiip) - Fiiq)] > q ir - p) [F2(r - q) - F2(r -p)] 



and 



pq[Fiip) - Fiiq)] <p{r-q) [F2ir - q) - F2(r -p)] 
These two inequalities together imply 

r — p 



P 



[F2(r -q)-F2ir- p)] < Flip) " Mq) 



and 



Flip) -Fiiq) < 



[F2(r - q) - F2(r - p)] 



Dividing this through by ip — q), we get finally 

r - p f F2(r - q) - F2(r - p) 
p \ ir — q) — ir — p) 



< 



and 



Flip) -Fiiq) < r_-q_ /F2(r 



Flip) -Fiiq) 
p-q 

) - F2ir-p) 



ir — p) 



(2.57) 
(2.58) 

(2.59) 
(2.60) 

(2.61) 
(2.62) 
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lim ^ (^2{r-q)-F,ir-p)\ ^ r-p^,^^ _ ^^^^^^ 



From this expression we know, by the Pinching Theorem of elementary calculus, that if the limits 

lim ^ ~^ ( ^^^^ -l) - P2{r - : 
q-^p p y (^j- — — (^f — p"j 

and 

^^r-g_/F,ir-q)-F,{r-p)\ ^ r_-p 
g^p q V [r — q) — [r — p) J p 

exist, then so does 

l^EM^lM^Fiip), (2.65) 
q^p p — q 

and the limits must be identical. In other words, if F2 is differentiable at (r — p), then Fi is 
differentiable at p and 

pF{{p) = {r - p)F^{r - p) . (2.66) 

Recall, however, that p2 is not uniquely fixed by p since n > 3. Thus neither is r; it can range 
anywhere between p and 1. This allows us to write the statement preceding Eq. ( p. 66 ) in the 



converse form: if Fi is not differentiable at the point p, then F2 is not differentiable at any point 
(r — p) in the open set (0, 1 — p). This is the sought after tight relation between Fi and F2. 

This statement can be combined with the fact that Fi and F2 are monotonic for the final thrust 
of the proof. For this we will rely on a theorem from elementary real analysis sometimes called 
Lebesgue's theorem |5^, page 96]: 

Lemma 2.1 Let f be an increasing real-valued function on the interval [a,b]. Then f is differen- 
tiable almost everywhere. The derivative f is measurable, and f^f'{x)dx < f{b) — f{a). 

Actually we are only concerned with the first conclusion of this. It can be seen in a qualitative 
manner as follows. The points where / is not differentiable can only correspond to kinks or jumps 
in its graph that are at best never decreasing in height. Thus one can easily imagine that, in 
travelling from points a to 6, a continuous infinity of such kinks and jumps (as would be required 
for a measurable set) would cause the graph to blow up to infinity before the end of the interval 
were ever reached. A more rigorous demonstration of this theorem will not be given here. 

From this theorem we immediately have that Fi must be differentiable everywhere on the closed 
interval [0, 1]. For suppose there were a point p at which it were not differentiable. Then F2 would 
not be differentiable anywhere within the measurable set (0, 1 — p), contradicting the fact that it 
is a monotonically nondecreasing function. Now, using the Pinching Theorem again, we have that 
F2 is differentiable everywhere. 

Therefore, let s = r — p. Since p and s are independent and Eq. ( |2.66D must always hold, it 
follows that there must be a constant a such that 

pF[{p) = sF!2{s) = a . (2.67) 

Because Fi and F2 are monotonically nondecreasing, a > 0. So 

F[{p) = - and F^(p) = -; (2.68) 
p p 

integrating this we get 

Fi(p) = alnp + 7i and F2(j») = alnp + 72 (2.69) 
where 71 and 72 are integration constants. 
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Running through the same argument but assuming qi = pi for all i except i = 1 and i = k, it 
follows in like manner that Fk{p) = a Inp + 7^ for all k = 1, . . . , n. This concludes the proof of the 
"only if" side of the theorem. 

To establish the "if" part of the theorem, we need only demonstrate the verity of the Shannon 
inequality, 

n n 

X^^'fclngfc < ^Pfclnpfc , (2.70) 

k=l k=l 

with equality if and only if = 9fe for ^ k. To see this note that the function /(x) = Inx is 
convex (since /" = — < 0) and consequently always lies below its tangent line at 2; = 1. In 
other words, f{x) always lies below the line 

y = /'(l)x + [/(!) -/'(I)] 

= x-1. (2.71) 

Hence, Inx < x — 1 with equality holding if and only if a; = 1. Therefore it immediately follows 
that 



In 

with equality if and only if pfc = qk- This, in turn, implies 



f ^) < ^ _ 1 , (2.72) 

\Pk/ Pk 



Pkilnqk -Inpk) = Pfclnf — ) < qk - Pk , (2.73) 

so that 

n 

^Pfe(lngfc-lnpfe) < 0, (2.74) 
fc=i 

with equality holding in the last expression if and only if p^ = qk for all k. Rearranging Eq. ( p. 74 ) 
concludes the proof. □ 

When n = 2, the above method of proof for the "only if" side of Theorem ( p.3| ) fails. This 
follows technically because, in that case, the quantity r must be fixed to the value 1. Moreover, 
it turns out that Eq. ( 2.45| ), independent of proof method, no longer specifies a relatively unique 
solution. For instance, even if one were to make the restriction Fi{p) = F2{p) = f{p), a theorem 
due to Muszely |^, 57 1 states that any function of the following form will satisfy the honest-expert 
inequality for all distributions: 

f{p) = (1 - P) u(^p - ^) + U{t) dt + C, (2.75) 

where C is constant and U{t) is any continuous, increasing, odd function defined on the open 
interval (— ^, ^)- Thus there are many, many ways of keeping the expert honest when talking about 
probability distributions over two outcomes. 

We, however, shall not let a glitch in the two-outcome case deter us in defining a new measure of 
distinguishability. Namely, using the robust payoff^ function defined by Theorem ( p.3| ) in conjunction 
with Eq. (|2.46| ), we obtain 

n 

K{po/pi) = ^po{b)[\npo{b) -lnpi{b)] 
6=1 

= EPo(6)lnr^). (2.76) 
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This measure of distinguishability has appeared in other contexts and is known by various names: 
the Kuhback-Leibler relative information, cross entropy, directed divergence, update information, 
and information gain. 



2.3.2 Properties of the Relative Information 

Though a probabihty assignment can never be confirmed as true or false, one's trust in a state of 
knowledge or probability assignment for some phenomenon can be either strengthened or weakened 
through observation. The simplest context where this can be seen is when the phenomenon in 
question is repeatable. Consider, as an example, a coin believed to be unusually weighted so that 
the probability for heads in a toss is 25%. If, upon tossing the coin 10,000 times, one were to find 
roughly 50% of the actual tosses giving rise to heads, one might indeed be compelled to reevaluate 
his assessment. The reason for this is that probability assignments can be used to make relatively 
sharp predictions about the frequencies of outcomes in repeatable experiments; the standard Law 
of Large Numbers |5^ specifies that relative frequencies of outcomes in a set of repeated trials 
approach the pre-assigned probabilities with probability 1. 

Not surprisingly, the probability for an "incorrect" frequency in a set of repeated trials has 
something to do with our distinguishability measure based on keeping the expert honest. This is 
because all the payment schemes considered in its definition were explicitly tied to the observational 
context. It turns out that the Kullback-Leibler relative information controls the exponential rate to 
zero forced upon the probability of an "incorrect" frequency by the Law of Large Numbers ||5^, |l^ . 
This gives us another useful operational meaning for the Kullback-Leibler information and gives 
more trust that it is a quantity worthy of study. This subsection is devoted to fleshing out this fact 
in detail. 

Let us start by demonstrating that the most probable frequency distribution in many trials 
is indeed essentially the probability distribution for the outcomes of a single trial. For this we 
suppose that an experiment of B outcomes, b £ B = {1, . . . ,B}, will be repeated n times. The 
probability distribution for the outcomes of a single trial will be denoted poib)- The n outcomes of 
the n experiments can be denoted by a vector b = {bi,b2, ■ ■ ■ , bn) G where bi G B for each i. The 
probability of any event E on the space B"^ — that is to say, any set E of outcome strings — will be 
denoted by P{E); the probability of the special case in which S is a single string b will be denoted 

Pib). ^ 

The empirical frequency distribution of outcomes in b will be written as F^{b), 

F^{b) = — of occurrences of b in 6^ , (2.77) 

and the set of all possible frequencies will be denoted by .F, 

T = { Fg(2), . . . , Fg(i?)) : 6 G S"} . (2.78) 

Note that the cardinality of the set which we shall write as \ J-\^ is bounded above by (n + 1)^. 
This follows because there are B components in the vector specifying any particular frequency Fg, 

Fs=fe=l,....!^) , (2.79) 

° \n n n J 

and each of the numerators of these components can take on any value between and n (subject 
only to the constraint that ^iUi = n). Thus there are less than (n + 1)^ choices for the vectors 
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Fg. The fact that \T\ grows only polynomially in n turns out to be quite important for these 
considerations. 

Now to get started, we shah also need a notation for the equivalence class of outcome strings 
with the same empirical frequency distribution pi (b) ; for this we adopt 



In this notation, we then have the following theorem. 



(2.80) 



Theorem 2.4 Suppose the experimental outcomes are described by a probability distribution po{b) 
that is in fact an element of J- . Let P{T{pi)) denote the probability that an outcome sequence will 
be in T{pi). Then 

P{T{po)) > P{T{pi)) . (2.81) 

That is to say, the most likely frequency distribution in n trials is actually the pre-assigned proba- 
bility distribution. 

Note that this theorem is restricted to probability assignments that are numerically equal to 
frequencies in n trials. We make this restriction to simplify the techniques involved in proving it 
and because it is all we will really need for the later considerations. 

A To see how this theorem comes about, note that 



PiTiPo)) 



ber{po) 

n 



\nPo)\ UPoib) 



b=l 



and similarly 



P(T(pi)) = \T{p,)\ Y[po{b) 



npi (b) 



(2.82) 



(2.83) 



6=1 



Expressions (|2^ ) and (|2^ can be compared if one realizes that |T(po)| is — essentially by 
definition — identically equal to the number of ways of inserting npo(l) objects of type 1, npo(2) 
objects of type 2, etc., into a total n slots. That is to say, |T(po)| is a multinomial coefficient |5^, 



\npo 



With this, we have 



(npo(l))! (npo(2))! ••• {npo{B))\ ' 

P{T{po)) ^ {Ub=iinpi{b)y){Ub=iPoibr''^'^) 
^('^(^i)) ~ (nf=i(npo(6))!)(nf=iPo(6) 



(2.84) 



npi (b) 



B 



(!^Pl(M^^(5)nbo(6)-pi{f.)] 



n (npo(6))! 



(2.85) 
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The desired result comes about through the inequahty 

777 I 

-- > n"^-" . (2.86) 
n! 

Let us quickly demonstrate this before moving on. First, suppose m > n. Then, 

— = m(m - 1) • • • (n + 1) > n • n • • • n = n"'"" . (2.87) 

(m—n) times 

Now suppose m < n. Then, 

^ = [n(n - 1) • • • (m + 1)]"^ > [ n-n---n ]-^ = {n")-^ = . (2.88) 

(n— m) times 



With Eq. (|]86|) in hand, Eq. (|]8|) gives 

B 

> n(^^'o(6))t"^^(^)-"p°(^)ipo(&) 



^('^(Po)) ^ TJ^^^^il,\\[npi{b)-npo(b)]^^(.^nlpo(b)-pi(b)] 



P{T{pi)) 



6=1 

n 

6=1 



n 



n[pi(6)-po(f')] 



= ^("E6bi(^)-Po(b)]) = ^0 ^ ^ _ (2.89) 

Therefore, 

P(T(po)) > P{T{pi)) , (2.90) 

and this completes the proof that the most likely frequency distribution in n trials is just the 
pre-assigned probability distribution. □ 

The next step toward our goal of demonstrating the Kullback-Leibler information in this new 
context is to work out an expression for the probability of an outcome string h with an "incorrect" 
frequency distribution. 

Theorem 2.5 Suppose the experimental outcomes are described by a probability distribution po{b) . 
The probability for a particular string of outcomes b E T{pi) with the ''wrong" frequencies is 

P(b) = g-"[-H'(Pi)+'^(Pi/po)] ^ (2.91) 

where 

B 

Hipi) = -J2piib)lnpiib) (2.92) 



6=1 

is the Shannon entropy of the distribution pi[b) . 
A This can be seen with a little algebra: 

n B B 

P{b) = Y[po{bi) = n^'o(^)"^'^^^ = []e"Pi(^)i°Po('') 

i=l 6=1 6=1 
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exp{n[pi(6) lnj3o(^) + Pi{b) lnpi(6) - pi{b) lnpi(6)]} 

6=1 



B 

Y[ exp n 

6=1 



pi{b) lnpi(6) +pi{b) In 



exp <^ -n 

I 6=1 



-pi(5) lnpi(6) +pi(6) In 



Pi (ft) 
Piib) 



Po{b) 



= exp{-n[H{pi) + K{pi/po)]} . a 
A simple corollary to this is, 



(2.93) 



Corollary 2.1 If the experimental outcomes are described by po(b), the probability for a particular 
string b £ T{pq) with the "correct" frequencies is 



P{b) = e-"-f^(Po) . 



(2.94) 



This follows because K{pq/pq) = 0. 

Using this corollary, in conjunction with Theorem 2A, we can derive a relatively good estimate of 
the number of elements in T{pi). This estimate is necessary for connecting the relative information 
to the probability of an "incorrect" frequency. 



Lemma 2.2 For any pi{b) G J^, 

(n + l)~^e"^(Pi) < |r(pi)| < e"-^(Pi) . 



(2.95) 



A This lemma can be seen in the following way. Suppose the probability is actually pi{b). 
Then, using Corollary we have 



P(T(pi)) = |T(pi)|e-"^(fi) 



(2.96) 



This probability, however, must be less than 1. Hence, |T(pi)| < e'^^^^^^ — proving the right-hand 
side of the lemma. The left-hand side follows similarly by considering the probability of all possible 



frequencies and using Theorem 2.4 



p{T{p,)) E 1 = mnT{pi)) 



< (n + l)^P(T(pi)) 

= (n+l)^|T(pi)|e-"^(Pi) . 

Thus (n -|- l)~^e"^'-Pi^ < |T(pi)| and this completes the proof of the lemma. □ 
Let us now move quickly to our sought after theorem and its proof. 



(2.97) 
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Theorem 2.6 Suppose the experimental outcomes are described by the probability distribution 
Po{b). The probability that the outcome string will have some "incorrect" frequency distribution 
Pi{b) is P{T{pi)). This number is bounded in the following way, 

{n + i)--Bg-nX(pi/po) < p(T(pi)) < e-"^(Pi/Po) . (2.98) 



A This is now easy to prove. We just need write 



P{T{p,)) = J2 P(b) 
ber(pi) 

ber(pi) 



|T(pi)|e-"[^(Pi)+-^(Pi/Po)l , (2.99) 



to see that we can use Lemma 2.2 to give the desired resuh. □ 



2.4 Distinguishability via Mutual Information 

Consider what happens when one samples a known probabihty distribution p{b), b = 1, . . . , n. The 
probabihty quantifies the extent to which the outcome can be predicted, but it generahy does not 
pin down the precise outcome itself. Upon learning the outcome of a sampling, one, in a very 
intuitive sense, "gains information" that he does not possess beforehand. For instance, if all the 
outcomes of the sampling are equally probable, then one will generally gain a lot of information from 
the sampling; there is essentially nothing that can be predicted about the outcome beforehand. On 
the other hand, if the probability distribution is highly peaked about one particular outcome, then 
one will generally gain very little information from the sampling; there will be almost no point in 
carrying out the sampling — its outcome can be predicted at the outset. This simple idea provides 
the starting point for building our last notion of distinguishability. 

Let us sketch the idea briefiy before attempting to make it precise. Suppose there is a reasonable 
way of quantifying the average information gained when one samples a distribution q{b); denote 
that quantity, whatever it may be, by H{q). Then, if two probability distributions po{b) and pi{b) 
are distinct in the sense that one has more unpredictable outcomes than the other, the average 
information gained upon sampling them will also be distinct, i.e., either H{pq) > H{pi) or vice 
versa. For, in sampling the distribution with the more unpredictable outcomes, one can expect 
to gain a larger amount of information. Thus, immediately, the notion of information gain in a 
sampling provides a means for distinguishing probability distributions. At this level, however, one 
is no better off than in simply comparing the probabilities themselves. To get somewhere with this 
idea, a more interesting scenario must be developed. 

The problem in comparing distributions through the information one gains upon sampling 
them is that the information gain has nothing to say about the distribution itself — that quantity 
is assumed already known. What would happen, however, if one were to randomly choose between 
sampling the two different distributions, say with probabilities ttq and tti? We can make a case for 
two distinct possibilities. First, perhaps trivially, suppose that in spite of choosing the sampling 
distribution randomly, one still knows which distribution is in front of him at any given moment. 
Then an average information gain H(pq) will ensue when the actual distribution is po{b) and H{pi) 
will ensue when the actual distribution is pi{b); that is to say, the expected information gain in 
a sampling a known distribution is just -kqH^pq) + TriH{pi). Notice that this quantity, being an 
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average, is greater than the lesser of the two information gains and less than the greater of the two 
gains, i.e., 

mm{H{po),H{pi)} < 7roH{po) + 7T,H{p,) < max{H {po) , H {p,)} . (2.100) 

Now consider the opposing case where the identity of the distribution to be sampled remains 
unknown. In this case, the most one can say about which outcome will occur in the sampling is 
that it is controlled by the probability distribution p{b) = iroPoib) + TTipi{b). In other words, the 
sampling outcome will be even more unpredictable than it was in either of the two individual cases; 
some of the unpredictability will be due to the indeterminism po{b) and pi(b) describe and some of 
the unpredictability will be due to the fact that the individual distribution from which the sample 
is drawn remains unknown. Hence it must be the case that H(p) > H{pq) and H{p) > H{pi). 

The excess of H{p) over ttqH^pq) + ■KiH{pi) is the average gain of information one can expect 
about the distribution itself. This quantity, called the mutual information [^, |6l| ], 

J(po,Pi;7ro,vri) = i7(7roPo + TTipi) - {t:qH{pq) + ■KiH{pi)^ , (2.101) 

is the natural candidate for distinguishability that we seek in this section. If the two distributions 
Po{b) and pi{h) are completely distinguishable, then all the information gained in a sampling should 
be solely about the identity of the distribution; the quantity J{pq^pi;-kq,t:i) should reduce to 
H{7r), the information that can be gained by sampling the prior distribution vr = {7ro,7ri}. If the 
distributions po{b) and pi{b) are completely indistinguishable, then J(poiPi; ttq, tti) should reduce 
to zero; this signifies that in sampling one learns nothing whatsoever about the distribution from 
which the sample is drawn. 

Notice that this distinguishability measure depends crucially on the observer's prior state of 
knowledge, quantified by vr = {7ro,7ri}, about whether po{b) or pi{b) is actually the case. Thus it is 
a measure of distinguishability relative to a given state of knowledge. There is, of course, nothing 
wrong with this, just as there was nothing wrong with the error-probability distinguishability 
measure; one just needs to recognize it as such. 

These are the ideas behind taking mutual information as a measure of distinguishability. In 
the remainder of this section, we work toward justifying a precise expression for Eq. ( ^lOlp and 
showing in a detailed way how it can be interpreted in an operational context. 

2.4.1 Derivation of Shannon's Information Function 

The function H(p) that quantifies the average information gained upon sampling a distribution 
p(b) will ultimately turn out to be the famous Shannon information function |6^, ^ 

H{p) = -J2p{b)lnp{b) . (2.102) 

b 

What we should like to do here is justify this expression from first principles. That is to say, we 
shall build up a theory of "information gain" based solely on the probabilities in an experiment 
and find that that theory gives rise to the expression ( p.l02| ). 

To start with our most basic assumption, we reiterate the idea that the information gained in 
performing an experiment or observation is a function of how well the outcomes to that experiment 
or observation can be predicted in the first place. Other characteristics of an outcome that might 
convey "information" in the common sense of the word, such as shape, color, smell, feel, etc., 
will be considered irrelevant; indeed, we shall assume any such properties already part of the very 
definition of the outcome events. Formally this means that if a set of events {xi,X2, ■ ■ ■ has 
a probability distribution p{x), not only is the expected information gain in a sampling, H{p), 
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exclusively a function of the numbers p{xi), p{x2), p{xn), but also it must be independent 
of the labelling of that set. In other words, H{p) = H{p{xi),p{x2), ■ ■ ■ ,p{xn)) is required to be 
invariant under permutations of its arguments. This is called the requirement of "symmetry." 

The most important technical property of H{p) is that, even though information gain is a sub- 
jective concept depending on the observer's prior state of knowledge, it should at least be objective 
enough that it not depend on the method by which knowledge of the experimental outcomes is 
acquired. We can make this idea firm with a simple example. Consider an experiment with three 
mutually exclusive outcomes x, y, and z. Note that the probability that z does not occur is 

p{^z) = I - p{z) = p{x) + p{y) . (2.103) 

The probabilities for x and y given that z does not occur are 

Pixh^) = ( , , and p{y\^z) = , f ^ , , • (2.104) 

p[x)+p[y) P{x)+p{y) 

There are at least two methods by which an observer can gather the result of this experiment. 
The first method is by the obvious tack of simply finding which outcome of the three possible ones 
actually occurred. In this case, the expected information gain is, by our convention. 



H 



{p{x), p{y), p{z)) . (2.105) 



The second method is more roundabout. One could, for instance, first check whether z did or did 
not occur, and then in the event that it did not occur, further check which of x and y did occur. 
In the first phase of this method, the expected information gain is 



H 



(pi^z), p{z)) . (2.106) 



For those cases in which the second phase of the method must be carried out, a further gain of 
information can be expected. Namely, 

h(p{x\^z), p{y\^z)) . (2.107) 

Note, though, that this last case is only expected to occur a fraction p{^z) of the time. Thus, in 
total, the expected information gain by this more roundabout method is 

h(p{^z), p{z)) + p{^z) h(p{x\^z), p{yhz)) . (2.108) 

The assumption of "objectivity" is that the quantities in Eqs. ( |2.105D and ( 2.1081 ) are identical. 
That is to say, upon changing the notation slightly to px = p{x), Py = p{y), Pz = piz) 

H{px,Py,Pz) = H{px+py,pz) + {px+Py)H[^^,^^] . (2.109) 

\Px+Py Px+PyJ 

In the event that we are instead concerned with n mutually exclusive events, the same assumption 
of "objectivity" leads to the identification, 

H{p,, ... ,Pn) = H{pi+p2,P3, ■■■ ,Pn) + {P1+P2)h(^^, . (2.110) 

\Pl +P2 PI+P2J 

It turns out that the requirements of symmetry and objectivity (as embodied in Eq. (|2.110[) ) 
are enough to uniquely determine the form of H[p) (up to a choice of units) provided we allow 
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ourselves one extra conveni ence []63| , namely, that we allow the introduction of an arbitrary positive 
parameter a ^ 1 into Eq. ( 2.110| ) in the following way, 

/ P2 \ 

Ha{pi, ■■■ ,Pn) = Ha{pi+P2,P3, ■■■ ,Pn) + {Pl+P2r Hal ■ , ■ , (2.111) 

\Pl+P2 PI+P2J 

and define H{p) to be the limiting value of Ha{p) as a^l. (The introduction of the subscript on 
Haip) is made simply to remind us that the solutions to Eq. ( p. Ill ) depend upon the parameter 
a.) This idea is encapsulated in the following theorem. 

Theorem 2.7 (Daroczy) Let 

Tn = |(pi, • • • ,Pn) I Pfc > 0, = 1, . . . ,n, and = l| (2.112) 

he the set of all n-point probability distributions and letT = 1J„ be the set of all discrete probability 
distributions. Suppose the function : T — > H, a 7^ 1, is symmetric in all its arguments and 
satisfies Eq. ( 2.11]\ ) for each n > 2. Then, under the convention that OlnO = 0, the limiting value 
of Ha as a — > 1 is uniquely specified up to a constant C by 

C " 

i7(pi, . . . ,p„) = - — ^Pilnpi . (2.113) 

i=l 



The constant C in this expression fixes the "units" of information. If C = 1, information is said 
to be measured in bits; if C = In 2, information is said to be measured in nats. (A relatively 
obscure measure of information is the case C = Iog2o2, where the units are called Hartleys |64|.) 
In this document, we will generally take C = ln2. On the occasion, however, that we do consider 
information in units of bits we shall write log() for the base-2 logarithm, rather than the more 
common log2(). 

A The proof of Theorem 2^, deserving wider recognition, is due to Daroczy [^] and proceeds 
as follows. Define Si = pi + ■ ■ ■ + pi and let 

fix) = Haix,l - x) for 0<x<l. (2.114) 

Then, by repeated application of condition ( p. Ill ), it follows immediately that 



1=2 



Haipl,... ,Pn) =E«"/ - • (2-115) 



Thus all we need do is focus on finding an explicit expression for the function /. 

We have from the symmetry requirement that Ha{x, 1 — x) = Ha{l — x, x) and hence, 

f{x) = f{l-x). (2.116) 

In particular, /(O) = /(I). Furthermore, if x and y are two nonnegative numbers such that x+y < 1, 
we must also have 

Ha{x, y,l - X - y) = Ha{y, x,l-x-y) . (2.117) 
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However, by Eq. ( p. Ill ) 

Ha{x,y,l-x-y) = Ha{x,l - x) + {1 - x)"Ha 

= H^ix,l-x) + (l-x)"Fo 



y l-x-y 



1 — X 1 — X 

y 1 y 



1 — X ' 



1 



1 — X 



fix) + (i-xri^- 



Thus it follows that / must satisfy the functional equation 



fix) + (l-x)"/ 



1 — X 



fiy) + (l-yr/(- 



X 



(2.118) 



(2.119) 



for x,y G [0,1) with x + y < 1. (In the case a = 1, Eq. ( p.ll9| ) is known commonly as the 
fundamental equation of information []66| .) 

We base the remainder of our conclusions on the study of Eq. ( 2.119| ). Note first that if x = 0, 
it reduces to, 

/(O) + fiy) = fiy) + (1 - y)"/(0) . (2.120) 

Since y is still arbitrary, it follows from this that /(O) = 0; thus /(I) = 0, too. Now let p = 1 — x 
for X / 1 and let = y/(l — x) = y/p. With this, the information equation (p.ll9|) becomes 

1 — p 



/(p)+p"/((?) = /(M) + (i-Mr/ 

We can use this equation to show that 



l-pq 



i^(p,'?)^/(p) + r + (i-pr]/(g) 



(2.121) 



(2.122) 



is symmetric in q and i.e., -F(p, q) = Fiq,p). From that fact, a unique expression for /(p) follows 
trivially. Let us just show this before going further: 







/(p) + r + (i-pn/k -/ o 



fip) 



(l-2i-")/(p) + /Q) [p- + il-p)--l] 



(2.123) 
(2.124) 



which implies that 

fip) = C (2i-° - l)"' [p° + (1 -p)" - 1] , 

where the constant C = fi\)- Because /(O) = /(I) = 0, Eq. (|2.124|) also holds for p = and p = \. 

To cap off the derivation of Eq. ( |2.124D , let us demonstrate that F(p, g) is symmetric. Just 
expanding and regrouping, we have, by Eq. (|2.121 ), that 



Fip, 



[/(p)+p"/(g)] + (l-p)"/(g) 
1 — p 



fipq) + {l-pq)V 
fipq) + il-pq)'' \f 



1 — pq 

1 — p 
l-pq 



+ {l-p)Viq) 

1-pY 



+ 



l-pq 



fiq) 



(2.125) 
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If we can show that the last term in this expression is symmetric in q and p, then we wih have 
shown that F[p, q) is symmetric. To this end, let us define 



Aip,q)=f 
Also, to save a little room, let 



1 — p 
l-pq 



+ 



1-P 
l-pq 



1 — p 
l-pq 



Then, 



1 — zq 



l-pq 

So that, upon using Eq. (|2.12l|) again, we get 



and 



1 — z = p 



1 



l-pq 



(2.126) 

(2.127) 
(2.128) 



Aip,q) = f{z) + z"f{q) 

= f{zq) + il-zqrf 



/(I - zq) + (1 - zqrf 



l-z 

I - zq. 

1 - z 



f 



1-g 

l-pq 
A{q,p) . 



+ 



1-g 

l-pq 



1 — zq 

m 



(2.129) 



Thus F{p,q) is symmetric. This completes the demonstration of Eq. (|2.124| ). 

We just need plug the expression for f{p) into Eq. ( p.ll5| ) to get a nearly final result, 



HaiPl 



,Pn) 



i=2 



-1 



^) +(1-- 



-1 . 



C 2 



)1— a 



C(2i- 
C(2i- 



-1 " 



\i=2 
' n 



1 Ep?-i 



(2.130) 



Now in taking the limit a— > 1, note that both the numerator and denominator of this expression 
vanishes. Thus we must use I'Hospital's rule in the calculating limit, i.e., first take the derivative 
with respect to a of the numerator and denominator separately and then take the limit: 

limH^ipi, ... ,Pn) = limC(-2i-"ln2)"' (EpHnpi) 
a— »0 a^O \ J ) 
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This completes our derivation of the Shannon information formula ( |2.102 ). It is to be hoped that 
this has conveyed something of the austere origin of the information-gain concept. □ 

We finally mention that the Daroczy informations of type-a, i.e., Eq. (|2l30|) , appearing in this 
derivation are of interest in their own right. First of all, there is a simple relation between these 
and the Renyi informations of degree-a introduced in Section |2.2|; namely. 



a 



Pi 



i=l 
1 



(2.132) 



Secondly, they share many properties with the Shannon information [^] while being slightly more 
tractable for some applications, there being no logarithm in their expression. 



2.4.2 An Interpretation of the Shannon Information 

The justification of the information-gain concept can be strengthened through an operational ap- 
proach to the question. To carry this out, let us develop the following example. Suppose we were 
to perform an experiment with four possible outcomes xi, X2, x^, X4, the respective probabilities 
being p{xi) = ^, p{x2) = |, pixs) = j, and p{x4) = ^. The expected gain of information in this 
experiment is given by Eq. ( |2.102 ) and is numerically approximately 1.68 bits. By the fundamental 



postulate of Section 2.4.1, we know that this information gain will be independent of the method 
of questioning used in discerning the outcome. In particular, we could consider all possible ways of 
determining the outcome by way of binary yes/no questions. For instance, we could start by asking, 
"Is the outcome If the answer is yes, then we are done. If the answer is no, then we could 

further ask, "Is the outcome X2?," and proceed in similar fashion until the identity of the outcome 
is at hand. This and three other such binary-question methodologies are depicted schematically in 



Figure ^T 



The point of interest to us here is that each such questioning scheme generates, by its very 
nature, a code for the possible outcomes to the experiment. That code can be generated by writing 
down, in order, the yes's and no's encountered in traveling from the root to each leaf of these 
schematic trees. For instance, by substituting and 1 for yes and no, respectively, the four trees 



depicted in Figure 2.1 give rise to the codings: 



Scheme 1 


Scheme 2 


Scheme 3 


Scheme 4 


xi 
X2 ^ 10 
X3 ^ 110 
X4 111 


xi ^ 00 

X2 ^ 01 

xi ^ 10 

Xi ^ 11 


Xl <-> 11 
X2 ^ 
Xl ^ 100 
Xl ^ 101 


Xl ^ Oil 
X2 ^ 010 

X3 ^ 00 
X4 1 



Codes that can be generated from trees in this way are called instantaneous or prefix-free and 
are noteworthy for the property that concatenated strings of their codewords can be uniquely 
deciphered just by reading from left to right. This follows because no codeword in such a coding 
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Scheme 1 




\ no • 



Scheme 3 




Figure 2.1: Binary Question Schemes 
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can be the "prefix" of any other codeword. As a case in point, using the coding generated by Scheme 
1, the "message" 1111100010100110 uniquely corresponds to the concatenation a;42;3XiXiX2X2Xi3;3; 
there are no other possibilities. 

From these four examples, one can see that not all such questioning schemes are equally efficient. 
For Scheme 1 the expected codeword length, i.e., 

I = ^p(x,)/(xi) , (2.133) 

where l{xi) is the number of digits in the code word for Xi, is 2.70 binary digits. Those for Schemes 
2-4 are 2.00, 2.55, and 1.75 binary digits, respectively. To see where this example is going, note 
that each expectation value is greater than H{p), the average information gained in sampling 
the distribution p{xi). This inequality is no accident. As we shall see, the Shannon noiseless 
coding theorem specifies that the expected codeword length of any instantaneous code must be 
greater than H{p). Moreover, the minimal average codeword length is bounded above by H(p) + 1. 
Reverting back to the language of questioning schemes, we have that the minimum average number 
of binary questions required to discern the outcome of sampling is approximately equal to 

H{p). 

This we take as a new starting point for interpreting the information function: it is approxi- 
mately the minimal effort (quantified in terms of expected number of binary questions) required to 
discern the outcome of an experiment. To make this precise, we presently set out to demonstrate 
how the noiseless coding theorem comes about within this context. 

Our first step in doing this is to demonstrate an elementary lemma of information theory, known 
as the Kraft inequality giving a useful analytic characterization of all possible instantaneous 
codes. 

Lemma 2.3 (Kraft) The codeword lengths h < h ^ ■ ■ ■ ^ In of any binary instantaneous code 
for a set of messages {xi, X2, ... , Xn} must satisfy the inequality 

n 

J2 2"^' < 1 • (2.134) 

k=l 

Moreover, for any set of integers ki < k2 < ■ ■ ■ < kn satisfying this inequality, there exists a 
instantaneous code with these as codeword lengths. 

A The derivation of this lemma is really quite simple. Start with the coding tree generating 
the instantaneous code and imbed it in a full tree of 2'" leaves. A full tree is a tree for which each 



direct path leading from the root to a terminal leaf is of the same length; see Figure p.4.2| . With 
this, one sees easily that the number of terminal leaves in the full tree stemming from the node 
associated with the codeword of length /j is just 2'"~^% i = 1, . . . ,n. Therefore it follows that the 
total number of terminal leaves in this tree not associated with codewords must be 

n 

^2'"-'v (2.135) 
1=1 

This number, however, can be no larger than the number of leaves in the full tree. Thus 

n 

J22^r.-h < . (2.136) 

i=l 

This proves the first statement of the lemma. 
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For the second statement of the lemma, start with a fuh tree of 2^^" leaves. First place the symbol 
xi at some node of depth ki and delete all further descendents of that node. Then place the symbol 
X2 at any remaining node of depth k2 and remove its descendents. Iterating this procedure will 
produce the appropriate coding tree and thus a instantaneous code with the specified codeword 
lengths. □ 

With the Kraft inequality in hand, it is a simple matter to derive the Shannon noiseless coding 
theorem. 

Theorem 2.8 (Shannon) Suppose messages xi, . . . , x„ occur with probabilities p{xi), . . . , p{xn). 
Then the minimal average codeword length lra\n for a binary instantaneous coding of these messages 
satisfies 

H{p) < < H{p) + 1 , (2.137) 
where H[p) is the Shannon information of the distribution p{xi) measured in bits. 

A To show the left-hand inequality, we just need note that for any instantaneous code the Kraft 
inequality specifies 

l-H{p) = ^p{xi)l{xi) + ^p{xi)\ogp{xi) 

i i 

= - ^ p{xi) log 2"'(^>) + p{xi) log p{xi) 

i i 

= i:pfe)>°<fg)->o8(i:2-<-..) 

2-i{xi) 

9(^0 = (2-139) 

is a probability distribution constructed for the purpose at hand. Namely, the final quantity on 
the right-hand side of Eq. ( p.l38| ) is then positive by the Shannon inequality, Eq. ( p. 70 ), already 
demonstrated. Thus it follows that 

l-H{p)>0, (2.140) 
and so the minimal average instantaneous codeword length must be at least as large as H{p). 



where 



Now to show the right hand side of Theorem 2.8, we need only note that there is an instantaneous 
code with codeword lengths given by 

l{xi) = \-\ogp{xi)-\ , (2.141) 

where [x] denotes the smallest integer greater than or equal to x, because these integers satisfy 
the Kraft inequality. Therefore it must hold that 

Irmn < ^ P{Xi)\ - log p{Xi)~\ 

i 

< p{Xi) (- \ogp{Xi) + l) 

i 

= H{p) + l. (2.142) 
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This concludes our proof of the Shannon noiseless coding theorem. □ 

This is one precise sense in which Imm ~ H{p). Actually, the exact value of Imin can be calculated 
given the message probabilities . . . , p{xn)- This comes about by an optimal coding procedure 

known as Huffman coding |]68| , |69| , [70| , |lO| . Therefore, one might have wished to choose Imin as the 
appropriate measure of information under the present interpretation. This, however, encounters 
an immediate objection: the Huffman coding algorithm gives no explicit analytical expression for 
Imin- Thus using Imin as a measure of information would be operationally difficult at best. Also, 
though, there are even tighter upper bounds on Imin than given by the noiseless coding theorem. 
For instance if p{xi) > p{x2) > ■ ■ ■ > p{xn), then |7l|, ^ 



^min -H{p) < 



(2.143) 



p{xi) + a if p{xi) < i 

2 - h{p{xi)) - p{xi) < p{xi) ifp(a;i)>i 
where a = 1 — log e + log log e .086 and 

h{x) = — xlogx — (1 — x) log(l — x) . (2.144) 

Another bound is 

Imin - H{p) < 1 - h{p{Xn)) < 1 " 2p(x„) . (2.145) 

Tighter bounds than this, in terms of p{xi) and p{xn), are known [|72| , but are not so easily 
expressible. The upshot is that these generally force Imin closer to H{p) than the noiseless coding 
theorem and thus strengthen the notion of "approximate" here. 



Finally, in this context, we note that a direct consequence of Theorem 2.8 is the following. If 
we were to repeat the experiment described by p{xk), say, times before asking a set of yes-no 
questions to discern all A^ outcomes, the minimum expected number of questions will be some Lmin 
that satisfies 

H{P) < Imin < H{P) + 1 , (2.146) 

where P{xi^,Xi^, . . . ,Xi^) = p{xi^)p{xi^) ■ ■ -pixi^) is the (product) probability distribution describ- 
ing all A^ outcomes. Using the fact that 



H{P) = - P{xi^,...,Xij^)logP{xi^,...,Xij^) 

X^^,...,Xi^ \k = l / 

n 

= -NYp{xk)logp{xk) 

k=l 

= NH{p) , (2.147) 



Eq. ( 2.1461) reduces to 



H{p) < ^Lmin < H{p) + 1 . (2.148) 

Therefore, if one is willing to collect data on multiple experiments before asking the yes-no ques- 
tions required to discern the outcomes, then one can make the expected number of questions per 
experiment as close to H{p) as one wishes — ^just by choosing A^ sufficiently large. This is another, 
strong sense in which H{p) quantifies the minimal effort required to discern the outcome of an 
experiment. 
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2.4.3 The Mutual Information 



The notion of information gain has now been analyzed from two very different perspectives, one 
axiomatic and one operational. With a firm expression for this notion finally at hand, we may return 
to the real object of this section, the measure of distinguishability known as mutual information. 
Using Eq. ( |2.102 ) in conjunction with Eq. ( |2.101| ), we obtain various formulations of this quantity 



J(po,Pi;vro,vri) = - lnp(6) + ttq ^Po(^) InpoW + tti lnpi(6) 

b b b 

= vroEPo(6)ln(^)+vri;^pi(6)ln(^) 

= TroK{po/p)+7riK{pi/p) , (2.149) 

where p{b) = 7ropo{b) + 7ripi(6) and K{/) denotes the Kullback-Leibler relative information of 
Section |2.3| . The last form gives a secondary interpretation to the mutual information: in an 
honest expert problem, it is the expert's expected loss for trying to pass off the mean distribution 
p{b) in place of either actual distribution po{b) or pi{b). 
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Chapter 3 



The Distinguishability of Quantum 
States 



3.1 Introduction 

For any given quantum state, the outcomes of most possible measurements are completely lawless 
in their determination. Quantum theory, however, provides the means for calculating probabilities 
for the outcomes. It is by this handle that the measures of distinguishability for probability dis- 
tributions can be used to say something about quantum mechanical states. In this Chapter, we 
work toward "quantizing" the classical distinguishability measures introduced in Chapter 2. The 
problem of statistically distinguishing quantum states po and pi on a D-dimensional Hilbert space 
via quantum measurement is that of using some measurement with n outcomes to generate the 
probability distributions poih) and used in the classical measures. The number n here should 
be thought of as a free variable that remains to be fixed; certainly it need not equal D. An optimal 
quantum measurement with respect to any of these measures is just a measurement that makes 
these quantities as large or as small as they can possibly be. The "quantized" measures are simply 
the numerical values of the classical measures when an optimal measurement is used. 

The strategy for making progress toward precise expressions of these quantities is to use the 
formalism of positive-operator-valued measures or POVMs ^] introduced in Chapter 1. As 

a quick reminder, a POVM is a set of positive operators -Ef, which is complete, i.e.. 



"... a priori one should expect a chaotic 
world which cannot be grasped by the 
mind in any way. One could (yes one 
should) expect the world to be subjected 
to law only to the extent that we order it 
through our intelligence." 



— Albert Einstein 
Letter to Maurice Solovine 
30 March 1952 



{tjj\Ei)\'il)) > for all b and all vectors \ip) 



(3.1) 



and 




(3.2) 
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A consequence of this definition is that the Ef, are Hermitian operators with nonnegative eigenvalues 



1 75]. The subscript b here, as before, indexes the possible outcomes of the measurement. Again, 
the conditions on the Ej, are those necessary and sufficient for the expression p{b) = ii{pEb) 
to be a valid probability distribution for the h. As described in Chapter 1, it turns out that a 
measurement corresponding to a POVM can always be interpreted as an "ordinary" orthogonal 
projection- valued measurement (i.e., one for which the outcomes correspond to eigenvalues of a 
Hermitian operator) on an extended system consisting of the given one along with an independently 
prepared auxiliary system; the labels h in that interpretation stand for the various outcomes to the 



ordinary measurement on the composite system |28]. 



The quantum measures of distinguishability we shall focus upon in this chapter are, listed in 
order of increasing unwieldiness: 

• the Quantum Error Probability 

Pe{po\h) = m>X! min|7rotr(po£'fe), 7Titi{piEb)^ (3.3) 



the Quantum Fidelity 



F{po,pi) = mm^ ,Jtr{poEh)^ti{piEb) (3.4) 



the Quantum Renyi Overlaps 



Fa{po/pi) = min^ (tr(/5o^fc))"(tr(pi^b))^ " , < a < 1 (3.5) 



the Quantum KuUback Information 



m^lh) = max ^tr(po^fe) Inf^^^l (^.6) 
{Eb\ T Vtr(pi£^b)/ 



• the Accessible Information 



/(po|pi)^max^ Uotr(/5o^fe)ln p^ + tr(/)i^,) In^^ (3.7) 
{Eb} TV V tr(p^b) / V tr(p^b) / J 

where 

/O = vro/5o + vripi. (3.8) 

Notice again that the number of measurement outcomes in these definitions has not been fixed at 
the outset as must be the case in the classical expressions. The notation used here is meant to 
convey the following. The comma separating pQ and pi in F{pQ,pi) is meant to convey that this 
function is symmetric upon their interchange. The slash in Fa{po/pi) and K{po/pi) is signifies 
that these functions are explicitly asymmetric in the two density operators. The bar in Pe{po\pi) 
and I{po\pi) is used to emphasize that these may or may not be symmetric, depending upon the 
value of the prior probabilities ttq and vri . 

The difficulty that crops up in extremizing quantities like these is that, so far at least, there seems 
to be no way to make the problem amenable to a variational approach: the problems associated with 
allowing n to be arbitrary while enforcing the constraints on positivity and completeness for the 
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Eh appear to be intractable. Moreover variational techniques generally only lead to the assurance 
of local extrema, perhaps never revealing the one that is globally best. New methods are required. 

Fortunately, the error-probability and statistical overlap distinguishability measures appear to 
be "algebraic" enough that one could well imagine using standard operator inequalities, such as the 
Schwarz Inequality for operator inner products, to aid in finding explicit expressions for Pe(po\pi) 
and F(pq,pi). That, in fact, is the case. The expression for Fa{po/pi) appears less tractable, but 
one might still hope that something like a Holder Inequality can be of use in this context. This 
remains an open question. On the other hand, when it comes to finding useful expressions for 
K{pq/pi) and I{po\pi), for the same reason, one should be less optimistic. Progress toward explicit 
expressions for Eqs. ( p.6[ ) and ( |3.7D are necessarily impeded by the "transcendental" character of 
the logarithm appearing in their definitions. Generally only bounds for these quantities may be 
found. 

This Chapter is devoted to fleshing out what is known about the quantum measures of distin- 
guishability. 

3.2 The Quantum Error Probability 
3.2.1 Single Sample Case 

An interesting particular case of the general quantum decision problem [^] is connected to the 
one introduced in Section |2.2| . A given quantum mechanical system is secretly prepared either in 
the (pure or mixed) state po or in the state pi. These two possibilities are described by the prior 
probabilities ttq and vri respectively. It is an observer's task to perform any quantum measurement 
he pleases on this system and then to make the "best" possible guess as to the state's true identity. 
The word "best" is in quotes because there are many things it can mean — for instance, it may 
depend upon the observer's various personal costs for being right or wrong. Here we shall specialize 
the notion of "best" measurement and "best" guess to be those which, when combined, minimize 
the expected error probability of the decision. The question is this: what quantum measurement 
should the observer use so that his expected probability of error is indeed as small as it can possibly 
be? The answer to this gives rise to an explicit expression for the measure of distinguishability 
called the Quantum Error Probability: 

Pe{po\pi) = min^ min|7rotr(po-E'fe), vritr(pi£'fc)| . (3.9) 

This problem can be much simplified by noticing the following. Any "measurement + guess" 
the observer can make can be summed up neatly as the measurement of a binary-valued POVM 
{Eq, El}, i.e., two nonnegative definite operators Eq and Ei such that Eq + Ei = 1. When outcome 
occurs, the observer chooses the state po; when outcome 1 occurs, he chooses state pi. Therefore 
the expected probability of error for a decision based on this measurement can be written as 

Pe = TTo tr(/5o^i) + vri tr(pi^o) • (3.10) 

That is to say, the expected probability of error is just the probability that po is the true state 
times the conditional probability that the decision will be wrong when this is the case plus a similar 
term for pi. So, it must be the case that 

min V min|7rotr(/5o-Eb), 7ritr(pi£;b)| = min ( ttq tr(po-E'i) + vri tr(/5i£^o) ) • (3.11) 
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Helstrom's minimal error-probability measm'ement is just the POVM {£'q,£'°} that minimizes 
Eq. (|3l^ ). In Ref. he showed that this POVM possesses the following explicit form. Both Eq 



and Ef are diagonal in a basis diagonalizing the Hermitian operator 

f = vTipi - vropo • (3.12) 
\0 of 4° 



With respect to this basis, the diagonal elements A'- of -Eq are assigned values according to the 



diagonal elements jj of F via the rule: 



A° = 1 when Tj < , 



A" = when 7j > . (3.13) 



For j such that 7j = 0, A^ may be assigned any value between and 1; we take it to be for 
definiteness. The operator E° is formed simply by working out Ef = 1 — Eq. 

One way the observer can implement this POVM is just to perform a standard von Neumann 
measurement of the Hermitian operator F and bin the outcomes according to whether they cor- 
respond to positive, negative, or zero eigenvalues. If a positive eigenvalue results, outcome 1 of 
the POVM is said to be found and the observer makes a decision appropriately. If a negative 
eigenvalue results, outcome of the POVM is said to be found. If a zero eigenvalue results, the 
posterior information is that either of the density operators is just as likely as the other. So in that 
case, any strategy for a decision will do. 

In the remainder of this Section, we shall rederive the explicit form of Helstrom's measurement 
in an elementary way that does not depend upon the variational techniques of Ref. |27]. This 



derivation is closely connected to the one appearing in Ref. [76|. 



Let {Eo,Ei} be an arbitrary binary- valued POVM. Using the fact that 

^0 + ^1 = i , (3.14) 



Eq. (3.10) becomes 

Pe = ^otr(po(i - ^O)) + TTitlipiEo) 



7rotr/5o - 7rotr(po^o) + vritr(/9i£'o) 

ttq + tr( (vTipi - TTopo)Eo) . (3.15) 



Therefore finding the minimum of reduces to finding the minimum of tr(f -Eg) over all operators 
Eq such that < < i. 

To do this, suppose the operator F has a spectral decomposition given by 

f = E7.U)(J| • (3.16) 
j 

Then 

tr(f^o)=E7.(il^o|j)- (3.17) 
j 

Because F is neither positive- nor negative-definite, this quantity is bounded below by the sum of 
its negative terms and, moreover, 

tr(f^o) >E'^i ' (3-18) 
j 
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where the prime on the summation sign signifies that the sum is restricted to those j for which 
7j < 0. This follows because < {j\E()\j) < 1 for all j. Note that the right hand side of Eq. ( ^.18 ) 



is independent of £"0- Hence, if we can find any Eq that satisfies this inequality via a strict equality, 
that POVM element must be optimal. 

To construct such an optimal Eq, i.e., one that achieves this lower bound J2'j^j, we may start 
by specifying its diagonal elements in this basis: 

{j\E°\j) = 1 when 7,- < 

{j\E^\j)=0 when 7, > . (3.19) 

For j such that 7^ = 0, we make take (jI-EqIj) to be any value between and 1; again we take it 
to be for definiteness. 

Now, since no mention has yet been made of the off-diagonal elements, it might at first appear 
that there are many optimal measurements for this problem. That, however, is incorrect; for it 
turns out that any measurement operator Eq satisfying Eq. ( 3.19| ) must also be diagonal in this 
basis. To see this, suppose the operator Eq has the spectral decomposition 

K = J2^k\ek){ek\ . (3.20) 

k 

Then, first consider a j such that > 0. For that, 

= {j\E°Q\j)=Y.ek\{ek\j)\' . (3.21) 

k 

Hence, because |(efc|j)P > in general, it must be the case that {ek\j) = whenever / 0. So 

{k\E"Q\j)=Y,ei{k\ei){ei\j)=0, (3.22) 

for any k ^ j such that > or 7^ > 0. To see that all other off-diagonal terms must vanish, one 
just need return to Eq. ( p. 15 ) and run through exactly the same argument as above to find that 



{k\E°\j) = for any k j such that < or 7^ < 0. Then because Eq + E° = 1, it follows that 
{k\E'i\j)=Qforallk^j. 

This completes the proof. The measurement operator Eq we have specified is unique up to 
the arbitrary choice of the diagonal elements for which 7^ = 0, though here we have chosen them 
to vanish for definiteness. It follows that E" is also unique to the same extent, i.e., in the basis 
diagonalizing F, its (j, j) matrix element is 1 whenever 7^ > — all other matrix elements either 
vanish or are set by the condition Eq -\- E° = 1. 

Thus we have an explicit form for the quantum error probability: 

PeWi) = ^0 + E 7i ' (3-23) 

7,<0 



where 7j are eigenvalues of the operator f defined in Eq. ( 3.12| ). 



3.2.2 Many Sample Case 

What happens to this criterion of distinguishability when there are M > 1 copies of the quantum 
state upon which measurements can be performed? There are at least two ways it loses its unique 
standing in this situation. The first is that one could imagine making a sophisticated measure- 
ment on all M quantum systems at once, i.e. on the ^^-dimensional Hilbert space describing the 
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complete ensemble of quantum states. This measurement is very likely to be more useful than any 
set of measurements on the systems separately [|77| . In particular, the optimal error-probability 
measurement on the big Hilbert space is a Helstrom measurement operator ( p. 12 ), except that the 
density operators of concern now are To = Po ^S' Po ^3 • • • <8 po and Ti = pi pi • • • pi, where 
each expression contains M terms and is the direct or Kronecker product for matrices [79|] . 
Thus the optimal measurement on the big Hilbert space can be written explicitly as the Hermitian 
operator. 



M 




(3.24) 



Clearly the distinguishability Pe(TolTt'i) will be no simple function of Pe(PolPi)- 

A second way for Pe(/5o|pi) to lose its unique standing in the decision problem comes about 
even when all the measurements are restricted to the individual quantum systems. For instance, 
suppose that pQ and pi are equally probable pure linear polarization states of a photon, one along 
the horizontal and the other 45° from the horizontal. The optimal error-probability measurement 
for the case M = l is given by the Helstrom operator V just derived, i.e., the measurement of the 
yes/no question of whether the photon is polarized 67.50° from the horizontal. On the other hand, 
if M = 2, the expression that must be optimized over all (Hermitian operator) measurements is no 
longer Eq. (|3.10| ), but rather 

Pe = ^min{Po(T)?'o(T), Pi(T)Pi(T)} + min{po(T)po(i), Pi(T)Pi(i)} 

+ imin{po(i)po(i), Pi(i)pi(i)} . (3.25) 

This reflects the fact that this experiment has four possible outcomes: jj, jj, jj, and ||, with 
I and [ denoting yes and no outcomes, respectively. The measurement that minimizes Eq. ( p. 25 ) 
can be found easily by numerical means; it turns out to be a polarization measurement 54.54° 
from the horizontal. In similar fashion, if M = 3, the optimal measurement is along the axis 49.94° 



from the horizontal. See Fig. 3.1. The lesson to be learned from this is that the optimal error- 
probability measurement is expressly dependent upon the number of repetitions M expected. This 
phenomenon has already been encountered in the classical example of Chapter 2. If M is to be 
left undetermined, then something beside the minimal error probability itself is required for an 
adequate measure of statistical distinguishability within the context of the decision problem. 



3.3 The Quantum Fidelity 

The solution of Section ^.2.2| to remedy this predicament was to shift focus to the optimal expo- 
nential decrease in error probability in the number of samples M. Translated into the quantum 
context, that would mean we should use the Quantum Chernoff Bound 

C(po,pi) = min mm J] (ti{poh)T (t^PiEbii''^ (3.26) 

as the appropriate measure of distinguishability. Instead, in this Section we shall focus on optimizing 
a particular upper bound to this measure of distinguishability, the statistical overlap introduced in 
Section ^12^21 . 

The "quantized" version of the statistical overlap is called the quantum fidelity and is defined 

by 



F{po,pi) = min^ J ticpoEhJ tvpiEb (3.27) 
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1 2 3 4 5 6 

2 X Polarizer Angle (in radians) 



Figure 3.1: Probability of error in guessing a photon's polarization that is either horizontal or 45° 
from the horizontal. Error probability is plotted here as a function of measurement (i.e., radians 
from the horizontal) and number of measurement repetitions M before the guess is made. 
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We shall show in this Section that 

F(/)o,/5i) = tr ^p\'^pop\'\ (3.28) 

where for any nonnegative operator A we mean by A^^"^ {or ^J~A^ the unique nonnegative operator 
such that iV2ii/2 ^ ^_ 

The quantity on the right hand side of Eq. ( [j.28D has appeared before as the distance function 

4(/)o,pi) = 2-2F(/)o,/5i) (3.29) 
of Bures [30, 81], the generalized transition probability for mixed states 

prob(po ^Pi) = (f(po, Pi))' (3.30) 



of Uhlmann 0, and — in the same form as Uhlmann's — Jozsa's criterion |^] for "fidelity" of signals 
in a quantum communication channel. Note that Jozsa's "fidelity" [^] is actually the square of the 
quantity called fidelity here. The fidelity was found to be of use within that context because it is 
symmetric in and 1, because it is invariant under unitary operations, i.e., 

FiUp^Ul UpiU^) = F{po,pi) , (3.31) 

for any unitary operator U, because 

0<F(po,pi)<l (3.32) 
reaching 1 if and only if po = pi, and because 

(f(po,Pi))' = (V'i|po|V'i) (3.33) 

when pi = \7pi){ipi\ is a pure state. 

The notion defined by Eq. (|3.30| ) is particularly significant because of the auxiliary interpretation 
it gives the quantity in Eq. ( p.28[ ). Imagine another system B, described by an D-dimensional 
Hilbert space, attached to our given system. There are many pure states {ipo) and {ipi) on the 
composite system such that 

trB(|^o)(V'o|) = Po and trB(|^i)(V'i|) = /5i , (3.34) 

where tre denotes a partial trace over System B's Hilbert space. Such pure states are called 
"purifications" of the density operators po and pi. For these, the following theorem can be shown 

Si 

Theorem 3.1 (Uhlmann) For all purifications {ipo) and of pQ and pi, respectively, 



KV'olV'i)! <trV7FW?'- (3.35) 

Moreover, equality is achievable in this expression by an appropriate choice of purifications. 

That is to say, of all purifications of po and pi, the ones with the maximal modulus for their inner 
product have it actually equal to the quantum fidelity as defined here. 

We should note that, in a roundabout way through the mathematical-physics literature (cf., for 
instance, in logical order |^], |82], [33|, |84], and | [^), one can put together a result quite similar in 
spirit to Eq. (|]2|)— that is, a maximization like ( |3.4D but, instead of over all POVMs, restricted to 
orthogonal projection- valued measures. What is novel here is the explicit statistical interpretation, 
the simplicity and generality of the derivation, and the fact that it pinpoints the measurement by 
which Eq. ( p. 28 ) is attained. 
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3.3.1 The General Derivation 



Before getting started, let us note that if po 
expression m Eq. ( |338D reduces to 



|V'o)(V'o| and pi = are pure states, the 



tr./|V'i)(Vi|V'o)(V'olV'i)(V'i 



KV'olV'i)! tr^|v^i)(V'i| 
K^ol^i)! . 



(3.36) 



This already agrees with the expression derived by Wootters ||85|, 45, H] for the optimal statistical 



overlap between pure states. Moreover, it indicates that Eq. ( |3.28D has a chance of being a general 
solution to Eq. (|]2^). 



The method we use for deriving Eq. ( 3.2g| ) is to apply the Schwarz inequality to the statistical 
overlap in such a way that its specific conditions for equality can be met by a suitable measurement. 
First, however, it is instructive to consider a quick and dirty — and for this problem inappropriate — 
application of the Schwarz inequality; the difficulties encountered therein point naturally toward 
the correct proof. The Schwarz inequality for the operator inner product ti{A^ B) is given by 

\ii{A^B)\^ < tr(iti)tr(S^^) , 

where equality is achieved if and only ii B = fiA for some constant /i. 
Let {Efy} be an arbitrary POVM and 

Po{b) = tr(po-E'b) and pi{b) = tr:{piEb) ■ 

By the cyclic property of the trace and this inequality, we must have for any b, 



(3.37) 



(3.38) 



Po{b)\/pi{b) 



tr 



/ a/2 A .1 

[Pi Ebp-^ 



tr((4V-V2)t(^i/-i/2 



> 



tr E, 



1/2 .1/2^1 1 ^^1/2 .1/2 



Po 



Pi 



, /-1/2A a/2^ 
= ^T^[Po EbPl 

The condition for attaining equality here is that 

^1/2.1/2 
Pi 



PbE^ Pq 



(3.39) 



(3.40) 



A subscript b has been placed on the constant /i as a reminder of its dependence on the particular E^ 
in this equation. From inequality (|3.39| ), it follows by the linearity of the trace and the completeness 
property of POVMs that 



b b 



, , a/2 a/2 
Pi Po 



(3.41) 
(3.42) 
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The quantity 

FA(po,/5i)=tr(p^/'pJ/') (3.43) 

is thus a lower bound to F(po,pi). For it actually to be the minimum, there must be a POVM 
such that, for all b, Eq. ( |3.40| ) is satisfied and 

tr(py^^,pf) = |tr(py^i,pf)|e^^ (3.44) 

where (p is an arbitrary phase independent of b, so that the sum can be taken past the absolute 
values sign in Eq. ( 3.41 ) without effect. 

These conditions, however, cannot be fulfilled by any POVM {Ef,} when pQ and pi do not 
commute. This can be seen as follows. Suppose [/5o,pi] / and, for simplicity, let us suppose that 
po can be inverted. Then condition ( 3.40| ) can be written equivalently as 



'^\pd-py'po'^')=0. (3.45) 



b 



The only way this can be satisfied is if we take the Eh to be proportional to the projectors formed 

^1/2 ^ 1/2 

from the le/i-eigenvectors of Pi Pq and let the ph be the corresponding eigenvalues. This is seen 
easily. 

The operator p^^ p^ is non-Hermitian by assumption. Thus, though it has D linearly inde- 
pendent left- and right-eigenvectors, they cannot be orthogonal. Let us denote the left-eigenvectors 
by {tprl and their corresponding eigenvalues by ar] let us denote the right-eigenvectors and eigen- 
values by \(pq) and A^. Then if Eq. ( p. 45 ) is to hold, we must have 

El/\pd-py'po'^')\^,)=0. (3.46) 

It follows that 

{pb-Xg)El^'m=0 (3.47) 

for all q and b. Now assume — again for simplicity — that all the Xq are distinct. If Eb is not 
identically zero, then we must have that (modulo relabeling) 

^y^i^g) = for all g / 6 and ^j, = Ag for g = 6 . (3.48) 

"1/2 

This means that E^ is proportional to the projector onto the one-dimensional subspace that is 
orthogonal to all the \(f)q) with q ^ b. But since 

= {A\py^po^^'^\(i)q) - {A\py'^po^^'^\(pq) 

= {ar - Xq){^r\<Pq) , (3.49) 

we have that (again modulo relabeling) \ipr) is orthogonal to \(j)q) for q ^ r and ar = Xq for q = r. 
Therefore 

eI'^ « . (3.50) 

The reason Eq. (|3.45| ) cannot be satisfied by any POVM is now apparent; it is just that the I'i/'ft) are 
nonorthogonal. When the are nonorthogonal, there are no positive constants a;, (6 = 1, . . . , n) 
such that 

n 

Y.ai,\iJi,)m = i- (3.51) 
b=i 
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For if there were, then the completeness relation would give rise to the equation 



6=1 



(3.52) 



so that 

^ab(V6|V'c>|V'6> = (1 -ae)|V'c> (3.53) 

b^c 

contradicting the fact that the {tph) are linearly independent but nonorthogonal. If po and pi were 

^1/2 ^ 1/2 

commuting operators so that p^ were Hermitian, there would be no problem; for then all the 
eigenvectors would be mutually orthogonal. A complete set of orthonormal projectors necessarily 
sum to the identity operator. 

The lesson from this example is that the nai've Schwarz inequality is not enough to prove 
Eq. ( ^.28 ); one must be careful to "build in" a way to attain equality by at least one POVM. 
Plainly the way to do this is to take advantage of the invariances of the trace operation. In 



particular, in the set of inequalities ( 3.39 ), we could just as well have written 



po{b) = ii{poh) = t^iypy^hpTu^ 



(3.54) 



for any unitary operator tj since WU = 1. Then, in exact analogy to the previous derivation, it 
follows that 



Po(6)Vpi(6) = ^tr((4^/^^f f/t)^(4^/^^f f/t)) ^ir[{Erpr)\Erpr 

iA{Ei''pru^)\Ei'^pr) 



> 



i.{uprhpr) 



where the condition for equality is now 



pl/2 1/2 _ Al/2 1/2 " t 
Eb Pi - Pbi^b Po ^ ■ 



(3.55) 



(3.56) 



This equation, it turns out, can be satisfied by an appropriate choice for the unitary operator U. 
To see this, let us first suppose that po and pi are invertible. Then Eq. ( |3.56| ) is equivalent to 



Ei^\i-PbPru^Pi'^')=o. 



Summing Eq. (|3.55|) on b, we get 



(3.57) 



(3.58) 



The final conditions for equality in this is that the Eb satisfy both Eq. ( |3.57| ) and the requirement 



tr 



(uprApr) 



tr 



{Up'J'Ebpl 



J4> 



(3.59) 



for all b, where again 4> is an arbitrary phase. 
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As in the last example, there can be no POVM {E^} that satisfies condition (3.57) unless the 
operator plJ'^Jj'^ ^^"^ is Hermitian. An easy way to find a unitary tj that makes a solution to 
Eq. ( 3.57 ) possible is to note a completely different point about inequality ( |3.58[) . The unitary 
operator tj there is arbitrary; if there is to be a chance of attaining equality in (|3.58| ), U had better 
be chosen so as to maximize 

i/2^i/2\ I ^^jj-j^g Q^j^ ^Y^^i ^i^^i particular tl forces pl^'^U^Pi ^^"^ 

to be Hermitian. 

To demonstrate the last point, we need a result from the mathematical literature 87, 75]: 
for any operator A, 



(3.60) 



max 


tr([/i) 


= max Re 


tr(;7i) 


u 




u 





trV^t^ 



where the maximum is taken over all unitary operators U. The set of operators U that gives rise 
to the maximum must satisfy 

UA-- 



(3.61) 



At least one such unitary operator is assured to exist by the so called polar decomposition theorem. 
When A is invertible, it is easy to see that 



U=\Ja^aA~^ (3.62) 

has the desired properties and is unique. When A is not invertible, U is no longer unique, but can 
still be shown to exist [79, pp. 74-75]. 



So a unitary operator Uc that gives rise to the tightest inequality in Eq. (3.58) can be taken to 
satisfy 



j-j .1/2.1/2 
UcPo Pi 



//.l/2.1/2\t/a/2.1/2\ 

y[Po Pi ) [po Pi ) 



1/2. .1/2 
Pi POPl 



When Po and pi are both invertible, this equation is uniquely satisfied by 

fj _ /.1/2. .1/2.-1/2.-1/2 
= V ^1 PoPl Pi Po 

With this, Eq. ( |3.58| ) clearly takes the form needed to prove Eq. ( |3.27| ): 



trWn^/^. .1/2 



(3.63) 



(3.64) 



(3.65) 



Inserting this choice for U into Eq. ( 3.57| ) gives the condition 

El^^(i-pbM) = 0. 

where the operator 

,> .l/2f->t .-1/2 .-1/2 /.1/2. .1/2.-1/2 

M = Po' UIp^ = p^ ' ^ p^ pqp^' p^ 



(3.66) 
(3.67) 
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is indeed Hermitian and also nonnegative (as can be seen immediately from its symmetry). Thus 
there is a POVM {Ef} that satisfies Eq. (|3l5^ ) for each b: the Ef can be taken be projectors onto 
a basis \b) that diagonalizes M. Here the must be taken to be reciprocals of M's eigenvalues. 
With the POVM {Ef}, Eq. (|]5|) is automatically satisfied. Since the eigenvalues 1/nb of M 



are all nonnegative, one finds that 

tr(c/e pl^^EfpY^) = tr^piMEf) = —tr^piEf) > . (3.68) 



This concludes the proof of Eq. (|3.28| ) under the restriction that pQ and pi be invertible. 



When po and/or pi are not invertible, things are only slightly more difficult. Suppose pi is not 
invertible; then there exists a projector Ilnuii onto the null subspace of pi, i.e., Hnuii is a projector 
of maximal rank such that 

Ilnuii pi = and pi Ilnuii = . (3.69) 

The projector onto the support of pi, i.e., the orthogonal complement to the null subspace, is 
defined by 

fisupp = i - rinull . (3.70) 

Clearly Enuii = iinuii satisfies Eq. ( |3.56| ) if the associated constant p^uW is chosen to be zero. Now 
let us construct a set of orthogonal projectors Eb = \b){b\ that span the support of pi and satisfy 
Eqs. ( 3.56 ) and ( p. 591) with (|3.63| ). This is done easily enough. Suppose the support of pi is an 



m-dimensional subspace; the operator 

Ri = nsuppPiIIsupp (3.71) 

is invertible on that subspace. Now consider any set of m orthogonal one-dimensional projectors 
rife in the support of pi. If they are to satisfy Eq. ( ^.56 ), then they must also satisfy the equations 



created by sandwiching Eqs. ( |3.56| ) and ( 3.63 ) by the projector 11, 



-isupp ■ 



UbRy^ = pbfib (nsuppPo^'t/^nsupp) , (3.72) 



and 



(ft 



suppt^cPo^ -ftgupp^ R^^ — ftsupp V Pl^ POPl Asupp . (3.73) 



Therefore, the life must satisfy 



5-1/2 /fV /-,l/2. -1/2-A- \ 5-1/2 

«1 iisuppVPl POPl J-^supp I-"-! 



ftfe = —fife . (3.74) 

Pb 



Hereafter we may run through the same steps as in the invertible case. Because the operator on the 
left-hand side of Eq. ( p. 74 ) is a positive operator, there are indeed m orthogonal projectors on the 



support of pi that satisfy the conditions for optimizing the statistical overlap. Taking the Eb = Hfe 
completes the proof. 

A particular case of noninvertibility is when pQ = |'0o)(V'o| ^-nd pi = are nonorthogonal 

pure states. Then Eq. ( 3.63| ) becomes 



f/c|V'o)(V'olV'i>(V'il = \/\i^i){Hi^o){i'o\i'i){H 

= KV'olV'i)! lV'i)(V'i| (3.75) 
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and Eq. ( p.56|) becomes 

\b){b\il;i){iPi\ = /if,|6)(6|V'o)(V'o|f>] • (3.76) 
If we redefine the phases of \ipo) and {ipi) so that {ipol'^i) is positive, we have that 

tlM = \A) ■ (3.77) 

Therefore, Eq. ( |3.76| ) imphes 

{b\i^i) = fibmo) . (3.78) 

This equation specifies that any orthonormal basis \b) containing vectors |0) and |1) lying in the 
plane spanned by the vectors |V'o) and {ipi) and straddling them will form an optimal measurement 
basis. This follows because in this case all inner products {b\Tpi) and (6|^o) will be nonnegative; thus 
Eq. ( p. 78 ) has a solution with nonnegative Hb- This supplements the set of optimal measurements 



found in Ref. |^] and is easily confirmed to be true as follows. Let 6 be the angle between lipo) and 
I'i/'i) and let (p be the angle between |0) and IV'o)- Then for this measurement, 



F{Po,Pi) = Ycos2</)cos2(0 + 6') + ^cos2(</>- I) cos2((/) + 6'- f) 
= cos (j) cos((/) + 6) + sin cj) sin{(j) + 6) 

= cos 6*, (3.79) 

which is completely independent of (p. This verifies the result. 

Equation j ^Tfj ) raises the interesting question of, more generally, what is the action of C/c? 



Could it be that Uc always takes a basis diagonalizing po to a basis diagonalizing pi? This appears 
not to be the case, unfortunately. The question of a more geometric interpretation of Uc is an open 
one. 

3.3.2 Properties 

In this Subsection, we report a few interesting points about the measurement specified by M and 
the quantum distinguishability measure F(po,pi). The equation defining the statistical overlap is 
clearly invariant under interchanges of the labels and 1. Therefore it must follow that 

F{po,p,) = F{pi,po) . (3.80) 

1/2 l/*^ 1/2 1/2 

A neat way to see this directly is to note that the operators PoPi Pq PiPq have the 

1/2 1/2 

same eigenvalue spectrum. For if \b) and Af, are an eigenvector and eigenvalue of p^ PoPi ; it 
follows that 

h{prp'Ab)) = prpi^'im) 

.1/2 ,1/2/, 1/2. ,l/2|,v\ 

= Po Pi [Pi PoPi \b)) 

= {f^o^'piPr){prpl^'\b)). (3.81) 

Hence, 



tr ^p^popr = J2^b = tT ^p^piPo^' , (3.82) 

b 



and so F{po,pi) = F{pi,po) 
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By the same token, the derivation of Eq. ( p.28|) itself must remain vahd if all the O's and I's 
in it are interchanged throughout. When po and pi are invertible, however, this gives rise to a 
measurement specified by a basis diagonalizing 



AT — --1/2 /~l/2. .1/2.-1/2 tn Qo\ 

^ = Po yPo PiPo Po ■ (3-83) 

It turns out that M and N can define the same measurement because not only do they commute, 
they are inverses of each other. This can be seen as follows. Let A be any operator and F be a 
unitary operator such that 



'iti = . (3.84) 
Hence A = V\f^A and also it = yitlyt. So 

y\/itiFt^ = (^yyiti^ (^\/itil>t^ =iit ^ (3.85) 



and therefore, because is a nonnegative operator, 



' AA^ = V\j A^AV^ = VAK (3.86) 
In particular, if 

^ /.1/2. .1/2 fj .1/2 .1/2 

VPi PoPi = C^cPo Pi ' (3-87) 

then 

^ /.1/2. .1/2 .1/2 .1/2 ,„ „„^ 

Therefore, the desired property follows at once, 

= i . (3.89) 

We may also note an interesting expression for M's eigenvalues that arises from the last result. 
Let the eigenvalues and eigenvectors of M be denoted by mt, and \b); in this notation Ef = \b){b\. 
Then we can write two expressions for mb- 

mb{b\pi\b) = {b\piM\b) 

= {b\pl/'u,p'fy) , (3.90) 



and 



—{b\po\b) = mm 



{b\p'j'u^,py'\b) 

{ml^'Ucp'J'lb))* . (3.91) 
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Because the left hand sides of these equations are real numbers, so are the right hand sides; in 
particular, combining Eqs. ( ^.90 ) and ( 3.91| ), we get 



m\b) 

{b\h\b) 



1/2 



1/2 



(3.92) 



The optimal measurement operator M can be considered a sort of operator analog to the 
classical likelihood ratio, for its squared eigenvalues are the ratio of two probabilities. This fact 
gives rise to an interesting expression for the Kullback-Leibler relative information between pQ and 
Pi with respect to this measurement: 



Kb{po/pi) ^ Y.{^TpoEf)lni 



„, I . , , ,-1/2 /~l/2-, --l/2-,-l/2 

2tr po In yp^' popi' Pi 



(3.93) 



This, of course, will generally not be the maximum of the Kullback-Leibler information over all 
measurements, but it does provide a lower bound for the maximum value. Moreover, a quantity 
quite similar to this arises naturally in the context of still another measure of quantum distin- 
guishability studied by Braunstein |8^, In yet another guise, it appears in the work of Nagaoka 
@. 

There are two other representations for the quantum fidelity F{pQ,pi) that can be worked out 
simply with the techniques developed here. The first is []9l| 



E{po,pi] 



min tr(poG') tr(/5iG' ^) , 
G 



(3.94) 



when pq and pi are invertible, and where the minimum is taken over all invertible positive operators 
G. This representation, in analogy to the representation of fidelity as the optimal statistical overlap, 
also comes about via the Schwarz inequality. Let us show this. 

Let G be any invertible positive operator and let U be any unitary operator. Using the cyclic 
property of the trace and the Schwarz inequality, we have that 



tr(poG)tr(piG-^) 



> 



tr 



tr 



{upI/'g'/')\upI/'g'/' 



tr 



p^G- 



tr 



/2 



(3.95) 



Since U is arbitrary, we may use this freedom to make the inequality as tight as possible. To do 
this we choose 

(3.96) 



frj ,1/2 ,1/2 /,l/2, ,1/2 
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to get that 



ti{poG)ii{piG-^)>F{p^,Pif ■ 



(3.97) 



To find a particular G that achieves equahty in this, we need merely study the condition for equality 
in the Schwarz inequality; in this case we must have 



Upl'^G^/^ = apY'G-'/' . 
The solution G to this equation is unique and easy to find, 



.-1/2 /-,l/2. ,1/2.-1/2 



(3.98) 



(3.99) 



The constraint that G be a positive operator further restricts a to be a positive real number. Thus 
the optimal operator G is proportional to the operator N given by Eq. ( ^.83 ). The choice for G 
given by Eq. ( |3.9g| ) demonstrates that Eq. ( |3.94| ) does in fact hold. This is easy to verify by noting 
the fact that A^"^ = M. 

The second representation is more literally concerned with the Bures distance defined by 
Eq. (|3!29|) . It is § 

dl{po,Pi)=mmti(^{Wo-Wi)(Wo-Wif^ , (3.100) 
where the minimum is taken over all operators Wq and Wi such that 



WqW^ = Po and WiWl = pi . 



(3.101) 



This is seen easily by noting that Eq. ( p.l01| ) requires, by the operator polar decomposition theorem, 
that 

and Wi = py^Ui (3.102) 



Wo = pTUo 



for some unitary operators Uo and Ui. Then the right hand side of Eq. ( |3.100| ) becomes 



min iv((Wo - W^i) (Wo - VFi)^ 



min trfw-oW^J - WoWl - WiW^ + WiWl] 
UoA ^ ' 



tr(/5o) +tr(/5i 



2 — 2 max Re 

V 

2-2F(po,/5i) 



) — 2 max Re 



tr 



{WoWt) 



(3.103) 



The last step in this follows from Eq. ( 3.6C| ) and demonstrates the truth of Eq. ( |3.100 ). 

Finally, we should mention something about the measurement specified by M, as it might 
appear in the larger context. A very general theory of operator means has been developed by Kubo 



and Ando |92, 93| in which the notion of a geometric mean [g4[ plays a special role. The geometric 
mean between two positive operators A and B is defined by 



More generally an operator mean is any mapping (A, B) 
sional Hilbert space to a D'-dimensional space such that 



(3.104) 

AaB from operators on a D-dimen- 
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1. a{Aa B) = {aA) a (aB) for any nonnegative constant a, 

2. AaA = A, 

3. Aa B > A' a B' whenever A>A' and B> B', 

4. a certain continuity condition is satisfied, and 

5. (f tif ) a {f^BT) > f^AaB)f for every operator f. 

Another characterization of the geometric mean is that (in any representation) it is the matrix 
X such that the matrix (on a D^-dimensional space) defined by 

is maximized in the matrix sense. 

In this notation, the optimal measurement operator M for statistical overlap is 

M = Pi^#Po. (3.106) 
The significance of this correspondence, however, is yet to be determined. 



3.3.3 The Two- Dimensional Case 

In this Subsection, we derive a useful expression for the measurement {E^} in a case of particular 
interest, two-dimensional Hilbert spaces. Here the best strategy for finding the basis projectors 
Ef is not to directly diagonalize the operator M, but rather to focus on variational methods. The 
great simplification for two-dimensional Hilbert spaces is that the signal states po and pi may be 
represented as vectors within the unit ball of M^, the so-called Bloch sphere: 

po = ^{i + a-a) and pi = ^{t + b-a) , (3.107) 

where a = \a\ < 1, b = \b\ < 1, and a is the Pauli spin vector. This follows because the identity 
operator 1 and the (trace-free) Pauli operators a = {ax,d'y,az), i.e., 

form a basis for the vector space of 2 x 2 Hermitian operators. In this representation the signal 
states are pure if a and b have unit modulus. More generally, the eigenvalues of po are given by 
^(1 — a) and |(1 + a) and similarly for pi. Consider an orthogonal projection-valued measurement 
corresponding to the unit Bloch vector n/n and its antipode —n/n; for this measurement, the 
possible outcomes can be labeled simply +1 and —1. It is a trivial matter, using the identity 

(a ■ a){n- a) = {a - n)i + ia-{axn) , (3.109) 

to show that 

Po(6) = ^(l±a-^) and pi(6) = ^ (l ± 6 • ^) . (3.110) 
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The statistical overlap, i.e., F{pQ,pi), with respect to this measurement can thus be written as 

(3.111) 



W{n) 



n 



n 



1 + a-- 1 + 6-- 



n 



n 



n 



n 



1 - a- - 1 - 6- - 



n 



The optimal projector for expression ( p.lll| ) can be found by varying it with respect to all vectors 
n, i.e., by setting 5W{n) = 0. Using 



5n = \/ ft ■ n\ = — 5{n ■ n) = — ii^n) ■ n + n ■ {5n) \ = — ■ {5n) 



(3.112) 



one finds after a bit of algebra that the optimal n must lie in the plane spanned by a and h and 
satisfy the equation 



A vector n-^ satisfying these two requirements is 

^ a b 



n 



. 



(3.113) 



(3.114) 



It might be noted that the variational equation 6W{n) = also generates the measurement with 
respect to which po and pi are the least distinguishable; this is the measurement specified by the 
Bloch vector no orthogonal to the vector d = b — a. 



If nB is plugged back into Eq. (3.111), one finds after quite some algebra that 
F{po,Pi) = (l + a • 6 + Vl-aVl-fo2^^^^ 



(3.115) 



(This expression does not match Hiibner's expression in Ref. |81| because of his convention that 
the Bloch sphere is of radius ^.) 



3.4 The Quantum Renyi Overlaps 

Recall that the quantum Renyi overlap of order a (for < a < 1) is defined by 

Faipo/Pi) = min^ (tvipoEf,))" (ti{piEb)f~'' . (3.116) 

This measure of quantum distinguishability, though it has no compelling operational meaning, 
proves to be a convenient testing ground for optimization techniques more elaborate than so far 
encountered. Equation ( |3.116| ) is more unwieldy than the quantum statistical overlap, but not 
so transcendental in character as to contain a logarithm like the Kullback-Leibler and mutual 
informations. One can still maintain a serious hope that an explicit expression for it is possible. 



Moreover, if an explicit expression can be found for Eq. ( 3.116 ), then one would have a direct 



route to the quantum Kullback information through either the Renyi relative information of order 



a [|36|, g7| 



a 




Kaipo/Pi) = ln( V Po(6)X6)i-" ) , (3.117) 
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or the relative information of type a of Rathie and Kannappan [p5| , |6^ 



(3.118) 



Both these quantities converge to the Kullback-Leibler relative information in the limit q — > 1, as 
can be seen easily by using I'Hospital's rule. Therefore, one just needs to find Faip^l p\) and take 
the limit a — > 1 to get the quantum Kullback information. 

The brightest ray of hope for finding an explicit expression for Faifi^lpx) is that the quantum 
fidelity was found via an application of the Schwarz inequality. There is a closely related inequality 
in the mathematical literature known as the Holder inequality |]96| , |97| that — at first sight at least — 
would appear to be of use in the same way for this context. This inequality is that, for any two 
sequences and 6^ of complex numbers, k = 1, . . . , n. 



fc=l 



< 




when p, g > 1 with 



1 1 

- + - = 1. 

V q 



Equality is achieved in Eq. (3.119) if and only if there exists a constant c such that 



\h\ 



c\ak 



iP-i 



(3.119) 



(3.120) 



(3.121) 



The standard Schwarz inequality is the special case of this for p = 2. 

One would like to use some appropriate operator analog to the Holder inequality in much the 
same way as the operator-Schwarz inequality was used for optimizing the statistical overlap: use it 
to bound the quantum Renyi overlap and then search out the conditions for achieving equality — 
perhaps again by taking advantage of the invariances of the trace. In particular, one would like to 
find something of the form 



tv{poEb)y {tr{piEb)) " > tr(/(po,Pi;a)4 



(3.122) 



with a function f{pQ,pi;a) that is independent of the POVM {Eh}. In this way, the linearity of 
the trace and the completeness of the Eb could be used in the same fashion as before. 

Unfortunately — even after an exhaustive literature search — an inequality sufficiently strong to 
carry through the optimization has yet to be found. Nevertheless, for future endeavors, we report 
the most promising lines of attack found so far. 

For the first demonstration, we need to list the standard operator-Holder inequality | 



99|,|ioq] 



tr 



(AB) 



< 



tr 



tr 



g/2\l/9 



and the Araki inequality |101, 102 1, 

.(^^1/2^^1/2 



tr 



< 



tr(c"'/2i)'-C'''/2 



(3.123) 



(3.124) 



for C and D positive operators and r > 1. These two inequalities can get us to something "close" 
to Eq. ( |3.122 ), though not quite linear in the Ei,. Let us show how this is done. Suppose positive 
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p and q satisfy Eq. ( p.l20| ). Then 



tri ( 




tr| ( e;; 



Pi 



> [tvlE^-p^E^'^) tr 



> 



> 



tv[E,'-pSE,'^E,"'pjE^'' 



(3.125) 



The upshot is a bound that is "almost" hnear in the E^,. There are other variations on this theme, 
but they are lacking in the same respect. 

A second way to tackle this problem via the Holder inequality is at the eigenvalue level. To set 
this problem up, let us use the notation \i{A) to denote the eigenvalues of any Hermitian operator 
j4, when numbered so that they form a nonincreasing sequence, i.e., 



Ai(i) > A2(i) > ••• > \d{A) . 



(3.126) 



With this, we may write down a theorem, originally due to Richter |105, 104], that places a bound 
on the trace of a product of two operators 



D D 

Y,\i{A)XD-i+i{B) < tr(iS) < Y.\{A)Xi{B). 

1=1 1=1 

Using this and the Holder inequality for numbers, one immediately obtains. 



(3.127) 



tr{poEb)f (tripiEb))' > [J^X^ipo] XD-^+l{Eb)j [J2Xi{pi) XD-i+i{Eb) 

> 2_^Xi{po)p Xiipi)" XD^i+i{Eb) 



(3.128) 



i=l 



This is — in a certain sense — linear in the operator E^. Regardless of this, however, the bound given 
by Eq. ( |3.12g| ), is far too loose. For instance, if the Eh are one-dimensional projectors, this bound 
reduces to the product Xoipo)^ ^d{pi)'^ and so gives rise to 



D 



J2Woh)y{tr{piEb)y > OXoikr^Xoip 



(3.129) 



b=l 



This can hardly be a tight bound, disregarding as it does all the other structure of the operators 
Pq and pi and their relation to each other. 

Could it be that an optimal POVM for the Renyi overlap can always be taken to be an orthogonal 
projection- valued measurement? Chances of this are strong, given that that was the case for the 
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statistical overlap. In case of this, let us point out the following. Suppose A is an invertible positive 
operator with spectral decomposition A = J2b'^b^b- Then applying the Holder inequality again, 
we have 



1 1 



(tr(/5oi^))^(tr(pii-^))^ = \^aliT{potib)j \^al\i{p,fi,) 

> ^(tr(/}onfe))^(tr(/)inb))v (3.130) 

h 

Therefore, as it was for the case of the quantum statistical overlap, it may well be that 

1 1 

mm " r: , 



A>0 



(tr(/5oi^'))^(tr(pii-'?))^ (3.131) 



actually equals the quantum Renyi overlap of order ^ . 



Let us point out a bound for this quantity. Note that, for any positive invertible operator A, 

\D-^+l{A-^) = (A.(i))'' . (3.132) 



Using the Richter and Holder inequalities, we have that 



1 / D \p/D 



1 1 



(tr(/5o>))^(tr(/)ii-'?))' > iY,XD-^+l{po)MAn) (Y. ^dM >^D-i+i{A- 



\i=l / \j=l 



i 1 
\ 9 



E A^_,+i(po) XiiAfX ff: \{pi) A.(i) 

\i=l I \i=\ 



D 

> 2^ AD-i+i(/Oo)''Ai(pi)'j . (3.133) 



This gives a nice lower bound, though it is most certainly not tight — as can be seen from that fact 
that it does not reproduce the quantum fidelity when p = 2. 

Let us now mention another method for lower bounding the quantum Renyi overlap that does 
not appear to be related to a Holder inequality at all. This one relies on an inequality of Ando |93] 
concerning the operator means introduced in Section ( |3.3.2| ). For any operator mean a, 

(tr(C'i)) a (tr(C'S)) > tr((7 (iaS)) , (3.134) 

where A, 13, and C are all positive operators. We can use this in the following way. Note that the 
mapping #a defined by 

Ai^^B = A^/^ (a-^/^BA-^/^J A^/"^ (3.135) 



satisfies all the properties of an operator mean ||9^. When A and B are scalars, and so commute, 
this operator mean reduces to 

A#aB = I3"A^-° . (3.136) 
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Therefore, for any POVM {Eb}, we can write the inequahty 

5](tr(/5o^6))"(tr(pi^,))'"" > J2^v[Eb{pi#aPo] 

b b 



= tr^^p'/'ip^'^'poP^'^Jpr) . (3.137) 

Therefore one obtains a lower bound on the quantum Renyi overlap of order a. Again the bound 
is not as tight as it might be because it does not reproduce the result known for a = ^. 



Finally, we should point out that Hasegawa [105, |106|| has studied the quantity 



4 



Ha{po/pi) = Y^^'[y-Po" Pi " )hj (3.138) 

in the context of distinguishing quantum states. The connection between this and Eq. ( |3.116| ) (if 
there is any) is not known. 

3.5 The Accessible Information 

A binary quantum communication channel is defined by its signal states {/5o,pi} and their prior 
probabilities 

ttq = 1 — t and tti = 4 , (3.139) 

for < t < 1. Consider a measurement {-Eb} on the channel. The probability of an outcome h 
when the message state \s k {k = 0, 1) is 

Pk{h) = tripkEb) ■ (3.140) 

The unconditioned probability distribution for the outcomes is 

pib) = iiipEb) , (3.141) 

for 

P = (1 - *)Po + tpi 
= Po + tA 

= pi-il-t)A, (3.142) 
where the difference operator A is defined by 

A = /5i - po . (3.143) 



The Shannon mutual information |1C, 60| for the channel, with respect to the measurement {Eb}, 
is defined to be 

J{po,Pi;t) = H{p)-{l-t)Hipo)-tH{pi) 

= {l-t)K{po/p)+tK{pi/p) , (3.144) 

where 

H{q) = -J2q{b)lnqib) (3.145) 
b 
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is the Shannon information pO| , 66, ^ of the probabihty distribution q{b). Because a natural 
fogarithm has been used in this definition, information here is quantified in terms of "nats" rather 
than the more commonly used unit of "bits." The accessible information |107| , |108| ] /(polpi) is the 
mutual information maximized over all measurements {Eb}- 

I{Po\Pi) = max V( -tr(y5i^;,)ln(tr(p^fe)) 
{Et} ^ ^ 

+ (1 -t) tr(po^6)ln(tr(po^fe)) +ttr(y5i^b)ln(tr(pi^f,; 

= max 7rotr(po£^fe) In — — ~— + TTitr{piEh) In — — ^— 

\tT{pEb)J \t,{pE,))) 

(3.146) 

In this Section we will often use the alternative notations J{t) and I{t) for the mutual and accessible 
informations; this notation is more compact while still making explicit the dependence of these 
quantities on the prior probabilities. 

The significance of expression ( [j.l44| ), explained in great detail in Chapter 2, can be summarized 
in the following way. Imagine for a moment that the receiver actually does know which message 
was sent, but nevertheless performs the measurement {-Eb} on it. Regardless of this knowledge, the 
receiver will not be able to predict the exact outcome of his measurement; this is just because of 
quantum indeterminism — the most he can say is that outcome h will occur with probability Pk{b)- 
A different way to summarize this is that even with the exact message known, the receiver will 
gain information via the unpredictable outcome b. That information, however, has nothing to do 
with the message itself. The residual information gain is a signature of quantum indeterminism, 
now quantified by the Shannon information H{pk) of the outcome distribution. Now return to the 
real scenario, where the receiver actually does not know which message was sent. The amount 
of residual information the receiver can expect to gain in this case is (1 — t)H{po) + tH{pi). This 
quantity, however, is not identical to the information the receiver can expect to gain in toto. That is 
because the receiver must describe the quantum system encoding the message by the mean density 
operator /}; this reflects his lack of knowledge about the preparation. For this state, the expected 
amount of information gain in a measurement of {Eb} is H{p). This time some of the information 
will have to do with the message itself, rather than being due to quantum indeterminism. The 
natural quantity for describing the information gained exclusively about the message itself (i.e., 
not augmented through quantum indeterminism) is just the mutual information Eq. ( 3.1441 ). 



3.5.1 The Holevo Bound 

The problems associated with actually finding I[t) and the measurement that gives rise to it are 
every bit as difficult as those in maximizing the Kullback-Leibler information, perhaps more so — 
for here it is not only the logarithm that confounds things, but also the fact that pQ and pi are 
"coupled" through the mean density operator p. Outside of a very few isolated examples, namely 
the case where />o and pi are pure states and the case where they are 2x2 density operators 
with equal determinant and equal prior probabilities |10g| , |110| |, explicit expressions for I{t) have 
never been calculated. Moreover no general algorithms for approximating this quantity appear to 



exist as yet. There is a result, due to Davies |Q, 111], stating that there always exists an optimal 
measurement {E^} of the form 

sD = ab|^b)(Vf.| 6 = l,...,iV, (3.147) 
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where the number of terms in this is bracketed by 



D< N < 



(3.148) 



for signal states on a D-dimensional Hilbert space. This theorem, however, does not pin down the 
measurement any further than that. The most useful general statements about I(t) have been in 
the form of bounds: an upper bound first conjectured in print by Gordon [ [L12| in 1964 (though 
there may also have been some discussion of it by Forney |113] in 1962) and proven by Holevo in 
1973 (though Levitin did announce a similar but more restricted result |114f| in 1969), and a lower 
bound first conjectured by Wootters |115] and proven by Jozsa, Robb, and Wootters [^. These 
bounds, however, are of little use in pinpointing the measurement that gives rise to I{t). In what 
follows, we simplify the derivation of the Holevo bound via a variation of the methods used in the 



Section |3.3| . This simplification has the advantage of specifying a measurement, the use of which 
immediately gives rise to a new lower bound to I{t). We also supply an explicit expression for a 
still-tighter upper bound whose existence is required within the original Holevo derivation. 
Since Holevo's original derivation, various improved versions have also appeared 



116, 117]. 



The improvements until now, however, have been in proving the upper bound for more general 
situations: infinite dimensional Hilbert spaces, infinite numbers of messages states and infinite 
numbers of measurement outcomes. In contrast, in this section and throughout the remainder of 
the report, we retain the finiteness assumptions made by Holevo; the aim here is to build a deeper 
understanding of the issues involved in finding approximations to the accessible information. 
The Holevo upper bound to I{t) is given by 



I{t) < Sip) - (1 - t)S{po) - tSipi) ^ Sit) , 



where 



D 



Sip) 



(3.149) 



(3.150) 



-tr(/j Inp) = — ^ Xj In Xj 

i=i 

is the von Neumann entropy [115] of the density operator p, whose eigenvalues are Xj, j = 1, . . . , D. 
Equality is achieved in this bound if and only if po and pi commute. 

The Jozsa-Robb- Wootters lower bound to /(t) is formally quite similar to the Holevo upper 
bound. For later reference, we write it out explicitly: 



lit) > Qip) - (1 - t)Qipo) - tQiPi) ^ Qit) 



where 



D 



Qip) 



Em 



Aj — Ak 



(3.151) 
(3.152) 



is the "sub-entropy" [11£, 120] of the density operator p. The formal similarity between Eqs. (3.149) 
and ( 3.151 ) becomes even more apparent if Sip) and Qip) are represented as contour integrals 

5(p) = (Inz) tr(^(i - z^^p) dz , (3.153) 



and 



Qip) 



-^/ (In.) det({i- z-^p)-' 



ic \ 

where the contour C encloses all the nonzero eigenvalues of p. 



dz , 



(3.154) 
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The key to deriving the Holevo bound is in reaUzing the importance of properties of J{t) and 
S{t) as functions of t Q. Note that 

J(0) = J(l) = S{0) = S{1) = . (3.155) 

Moreover, both J(t) and S{t) are downwardly convex, as can be seen by working out their second 
derivatives. For J(t) a straightforward calculation gives 

ftr(A^fe))^ 

For S{t) it is easiest to proceed by using the contour integral representation of S{p). By 
differentiating within the integral and using the operator identity 

—— = -A^^—A-^ (3.157 
dt dt ^ ' 

(which comes simply from the fact that A~^A = 1), one finds 

^S{p) = ^— (f (zlnz) trf (zi - pY^A(zi - pV^) dz , (3.158) 



dt 27ri Jc 

and 



^5(/5) = -^£(zlnz)tr(^(zi-p) ^A(zi-p) ^ dz . (3.159) 
Therefore, if \j) is the eigenvector of p with eigenvalue Xj and Ajk = (j|A|A;), we can write 



S"{t) = - HX„Xk)\Ajkf, (3.160) 

{j,fc|A,+Afc^O} 



where 

2 f z In z 



2m Jc [z - Xky{z - Xj) 
An application of Cauchy's integral theorem gives 

In x — In w „ , , , 

$(x, y) = ^ if X / y , (3.162 

x-y 

and 

$(x,x) = -. (3.163) 

Expressions ( |3.156 ) and (|3.160[) are thus clearly nonpositive. 

The statement that S{t) is an upper bound to J{t) for any t is equivalent to the property 
that, when plotted versus t, the curve for S{t) has a more negative curvature than the curve for 
J{t) regardless of which POVM {Ef^} is used in its definition. This has to be because S{t) must 
climb higher than J{t) to be an upper bound for it. The meat of the derivation is in showing the 
inequality 

S"{t) < J"{t) < for any POVM {Eb} ■ (3.164) 

Holevo does this by demonstrating the existence of a function L"{t), independent of {Ef^}, such 
that 

S"{t) < L"{t) and L"{t) < J"{t) . (3.165) 
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From this it follows, upon enforcing the boundary condition 



L(0) =L(1) = 



(3.166) 



that 



I{t) < Lit) < Sit) . 



(3.167) 



(It should be noted that L(t) is not explicitly computed in Ref. it is only shown to exist via the 
expression for its second derivative.) 

At this point a fairly drastic simplification can be made to the original proof. An easy way to get 
at such a function L"it) is simply to minimize J"(t) over all POVMs {Eb} and to define the result 
to be the function L"it). Thereafter one can work to show that S"it) < L"it). This is decidedly 
more tractable than extremizing the mutual information J(t) itself because no logarithms appear 
in J" it); there can be hope for a solution by means of standard algebraic inequalities such as the 
Schwarz inequality. This approach, it turns out, generates exactly the same function L"it) as used 
by Holevo in the original proof, though the two derivations appear to have little to do with each 
other. The difference of real importance here is that this approach pinpoints the measurement that 
actually minimizes I" it). This measurement, though it generally does not maximize J(t) itself, 
necessarily does provide a lower bound to the accessible information lit). 

The problem of minimizing Eq. ( |3.156| ) is formally identical to the problem considered by 
Braunstein and Caves [122]: the expression for — /"(t) is of the same form as the Fisher information 
optimized there. The steps are as follows. The idea is to think of the numerator within the 
sum ( p.l56| ) as analogous to the left hand side of the Schwarz inequality: 



|tr(i'fS)|2 < tviA^A)tviB^B) 



(3.168) 



One would like to use this inequality in such a way that the tripEb) term in the denominator 
is cancelled and only an expression linear in Eh is left; for then, upon summing over the index 
b, the completeness property for POVMs will leave the final expression independent of the given 
measurement. 

These ideas are formalized by introducing a "lowering" super-operator Q^, (i.e., a mapping from 
operators to operators that depends explicitly on a third operator C) with the property that for 
any operators A and B and any positive operator C, 

(3.169) 



(3.170) 



triAB) < tri^CBg^iA)] 
There are many examples of such super-operators; perhaps the simplest example is 

g^iA) = Ac-^ , 



when C is invertible. In any case, for these super-operators, one can derive — via simple applications 
of the Schwarz inequality — that 



(tr(A^t 



< 



tT(^pEbGpiA 
tr((4^/^pV2)t(^i/2^„(A)pV2) 



< 



tvipEb)tv(Eb{gpi^)pgpiA)^) 



(3.171) 
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and 



< 



< ti{pEb) tr hb (A) (A)1 



(3.172) 



By g^,{Ay , we mean simply the Hermitian conjugate to the operator g^{A). The conditions for 
equahty in these are that the super-operators Gp and Gpi/2 give equahty in the first steps and, 
moreover, saturate the Schwarz inequahty via 

El%{A)p'^' = PbEl'^p^'^ , 

and 



(3.173) 
(3.174) 



respectively. 

Using inequalities ( |3.171 ) and ( 3.172 ) in Eq. ( ^.156| ) for J"{t) immediately gives the lower 
bounds 

(3.175) 



and 



r{t) > -tr(g^(A)pg^(A)t 



(3.176) 



The problem now, much like in Section p.3|, i s to choose a super-operator or in such a way that 
equality can be attained in either Eq. (|3.175| ) or Eq. ( |3.176| ). 

The super-operator Cp that does the trick for minimizing Eq. (3.156) [122] is defined by its 
action on an operator A through 



(3.177) 



[p£p{A)+£^{A)p)=A. 

This equation is a special case of the operator equation known as the Lyapunov equation 

BX + XC = b . (3.178) 

The Lyapunov equation has a solution for all D if and only if no eigenvalue of B and no eigenvalue 
of C sum to zero. Thus when p has zero eigenvalues, Cp{A) is not well defined for a general operator 
A. For the case of interest here, however, where A = A, Cp{A) does exist regardless of whether p 
has zero eigenvalues or not. This can be seen by constructing a solution. 

Let I j) be an orthonormal basis that diagonalizes p and let \j be the associated eigenvalues. 
Note that if A,- = 0, then 



= (j|/5|j) = (l-t)(j|/5o|j)+t(j|/5i|j). 



(3.179) 



Therefore, if < t < 1, we must have that both {j\po\j) = and = 0. So if Xj = 0, then 

1/2 1/2 n 3 

Po \j) = E'-iid Pi \j) = 0- III particular, sandwiching Eq. ( p.l77| ) between eigenvectors of p, we 
find that 



-{Xj + Xk)Cp{A)jk — Ajk 



(3.180) 
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has a solution for Cp{A)jk = {j\Cp{A)\k) because Ajk = (j|A|A;) vanishes whenever Xj + Afc = 0. 
With this Cp{A) becomes 



^p(^)^ E -^-—-Ajk\j){k\ 



(3.181) 



where we have conveniently set the terms in the null subspace of p to be zero. (For further discussion 
of why Eq. ( 3.181 ) is the appropriate extension of Cp{A) to the zero-eigenvalue subspaces of p, see 
Ref. |122|] ; note that £^ is denoted there by TV^^ ■) Eq. ( |3.181 ) demonstrates that Cp{A) can be 
taken to be a Hermitian operator. 

The super-operator Cp is easily seen to satisfy the defining property of a "lowering" super- 
operator, Eq. ( 3.177| ), because, for Hermitian A and B, 

> Re 

_ 1 

~ 2 

_ 1 

~ 2 



ti [pACpiB] 



tv(pACp{B] 

ti[pACp{B))+iT(pACp{B\ 
ti[pACp{B)) +iT[Cp{B)Ap) 
iv(AUCp{B)p + pCp{B) 



= tr(iS) . 
The desired optimization is via Eq. (|3T75| ): 

J"{t) > -iT(Cp{A)pCp{A) 



(3.182) 



-tr 



(A£^(A) 
E 



lA 



jk\ 



(3.183) 



Equality can be satisfied in Eq. (3.183) if 



Im 



tr 



(pEbCp{A 



for all b , 



,1/2 



for all b . 



(3.184) 
(3.185) 



and, from Eq. (3.173), 

Cp{A)El'^ = ptE^ 

The second of these can be met easily by choosing the operators Ei, = E^ = \b){b\ to be projectors 
onto an eigenbasis for the Hermitian operator Cp{A) and choosing the constants pi, to be the 
eigenvalues of £p(A). The first condition then follows simply by Eq. ( 3.185| ) being satisfied. For 



iiipEbCpiA) 



tT[pEl"El'^Cp{A) 
Pbti[pEbj , 



(3.186) 



which is clearly a real number. 
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The function L"{t) can now be defined as 

L"{t) = -tr(A£A(A)) . (3.187) 



This, as stated above, is exactly the function L"{t) used by Holevo, but obtained there by other 
means. Again, among other things, what is new here is that the derivation above gives a way of 
associating a measurement with the function L"{t). This measurement may be used to define a 
new lower bound to the accessible information I{t); this bound we shall call M{t). 

The next step in the derivation of Eq. ( p.l49| ), to show that S"{t) < L"{t); i.e., that 

^ $(A,-,Afe)|A,fe|2 > V^^I^J''^''- (^-^^^^ 

This can be accomplished by demonstrating the arithmetic inequality ||5|, [123| , 

<^>(x,y)>^. (3.189) 

x + y 

That is, in words, that the arithmetic mean of x and y is greater than or equal to their logarithmic 
mean [124]. We reiterate Holevo's method of proof in particular. The case for #(x,a;) is easy: 
equality is satisfied automatically. Suppose now that x 7^ y for < x, y < 1 and let 



x-y 

s = 

X + y 

Then 



(3.190) 



X = ^{x + y){l + s) and y = ^(x + y)(l - s) , (3.191) 
and < |s| < 1. Hence 

In X - In y = ln(l + s) - ln(l - s) (3.192) 
has a convergent Taylor series expansion about s = 0. In particular, one has 

1 °° 1 

-(x + y) $(x, y) = 1 + ^ 72^;Tt/^ ' ^^'^^^^ 

n=l ^ ' 

Since this expansion contains only even powers of s, it follows immediately that Eq. ( p.l8S| ) holds, 
with equality if and only if x = y. This completes the demonstration that the Holevo upper bound 
is indeed a bound for the accessible information of a binary quantum communication channel. 

The only piece remaining to be shown is that the bound is achieved if and only if po and pi 
commute. The "if" side of the statement is trivial; one need only choose the Ef, to be projectors 
onto a common basis diagonalizing both po and pi. Then one has immediately that 

Jit) = lit) = Sip) - (1 - t)Sipo) - tSipi) . (3.194) 

For the noncommuting case, if we can show that L(t) is strictly less than Sit), then our work will 
be done. 

We do this by showing that L(t) = Sit) implies that po and pi commute. Taking two derivatives 
of the supposition, we must have that S"it) = L"it). Note, from Eq. ( p.l88| ) and the properties of 
<I'(x,y), that this holds if and only if 

lAjfcp = for all j . (3.195) 
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The latter condition implies that po and pi commute. This is seen easily, for note that Eq. (3.195) 
implies 

= ^{Xj - XkftvltlkAUjA 



= ^tr nfc(pA-A/5jn, (Ap-pA 

= tr(^(A/5- pAy(A/5- pA)^ . (3.196) 
This implies A/> — pA = and thus [/Ooi/'i] = 0. This completes the proof. 
3.5.2 The Lower Bound M{t) 

Now we focus on deriving an explicit expression for the lower bound M{t) to the accessible infor- 
mation. This bound takes on a surprisingly simple form. Moreover, as we shall see for the two 
dimensional case, this lower bound can be quite close to the accessible information and sometimes 
actually equal to it. 

For simplicity, let us suppose pQ and pi are invertible. We start by rewriting, in the manner of 
Eq. (3.93), the mutual information as 

J{t) = ii({l-t)pQY.{\nai,)h+ tpiY,i^nPb)Eb) , (3.197) 
^ b h ^ 

where 

tr(po|^ and ^='^^. (3.198) 

The lower bound M{t) is defined by inserting the projectors onto a basis that diagonalizes 
>Cp(A) into this formula. This expression simplifies because of a curious fact: even though pQ and 
pi need not commute, Cp{po) and Cp{pi) necessarily do commute. This follows from the linearity 
of the Cp super-operator: 

£p(/5o) = Cp{p-tk) 

= i - tCp{A) , (3.199) 

and 

JZpipi) = Cp[p+{l-t), 

= 1 + {I - t)Cp{A) . (3.200) 

Thus the projectors that diagonalize Cp{A) also clearly diagonalize both Cp{po) and Cp{pi). 

Therefore, if |5) is such that = \b){b\ and apb is the associated eigenvalue of Cp{po), then 
the definition of Cp{po), Eq. (|3.177 ), requires that 

«- = M = ^- (3-201) 

mb) tr{pEl) ^ ' 
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Thus the operator Cp{po), much hke the operator M = y Pi"^ PoPi"^ Pi ^^^) considered 
an operator analog to the classical likelihood ratio. Similarly, for the eigenvalues /3f6 of Cp{pi), we 
must have 

= ^ . (3.202) 

Hence M{t) takes the simple form 

M(t) =tr(^(l-t)poln(/:p(po)) +tpiln(£p(/5i))) . (3.203) 

As an aside, we note that one can also obtain by this method a lower bound to the quantum 
Kullback information (in the vein of Eq. ( |3.93| )). Using the measurement basis that diagonalizes 
^Pi(Po) in the classical Kullback-Leibler relative information, we have, via the steps above, the 
expression 

Kpipo/pi) = tr(^poln(/:pi(/5o))) , (3.204) 

whenever Cp-^{po) is well defined. 

There is a close relation between the lowering super-operator discussed here and the optimal 
measurement operator M for statistical overlap found in Section 3^. This can be seen by noting 
that for small e 

^A + eB ^ \[A + \^L^^i^{B) , (3.205) 

when A is invertible (as can be seen by squaring the left and right hand sides), and for any operator 
B that commutes with p, 

Cp{BAB) = BCp{A)B . (3.206) 

Then, when 

po = pi + 5/5 , (3.207) 
so that the two density operators to be distinguished are close to each other, 



M = p-,"'^|pr{h+mprp-.'^' 

= - \ Cp,{pi). (3.208) 

Thus the measurement bases defined by M and Cpg{pi) in this limit can be taken to be the same. 

Equations ( 3.204 ) and ( |3.93|) , it should be noted, are both distinct from that quantity usually 
considered the quantum analog to the Kullback-Leibler information in the literature |125| , |126| ]. 
That quantity, given by 

Ku{po/pi) =ti(^polnpo - polnpi^ , (3.209) 

is not a lower bound to the maximum Kullback-Leibler information. In fact, the Holevo upper 
bound to the mutual information is easily seen to be expressible in terms of it: 

I{t) < (1 - t)Kv{po/p) + tKv{pi/p) . (3.210) 
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3.5.3 The Upper Bound L{t) 



The upper bound L{t) has not yet yielded a form as sleek as the one found for M{t). All that need be 
done in principle, of course, is integrate Eq. ( |3.187| ) twice and apply the boundary conditions L(0) = 
L(l) = 0. The trouble lies in that, whereas general methods exist for differentiating operators with 



respect to parameters [127, 128, 129, 130, 131, 132|, methods for the inverse problem of integration 
are nowhere to be found. The main problem here reduces to finding a tractable representation 



for £^(A). This has turned out to be a difficult matter in spite of the fact that Eq. ( 3.177 ) is a 
special case of the Lyapunov equation and has been studied widely in the mathematical literature. 
Though there are convenient ways for finding numerical expressions for Cp{A) [|133|| (even when 
the matrices involved are 1000-dimensional [ [134| ]), this is of no help to our integration problem. 
On the other hand, there do exist various methods for obtaining an exact expression for Cp{A) in 

\13(\, 137, 138, 13£]. These expressions can be integrated 



a basis that does not depend on t [|135 
in principle. However the representations so found appear to have no compact form that makes 
obvious the algorithm for their calculation. 

Two of the more useful representations for Cp{A) [ [L40| , 141] might appear to be, a contour 
integral representation {when p is invertible), 



2 

2TTi 



{zi - p) M(zi + /5) dz , 



(3.211) 



where the contour contains the pole at z = Xj for all eigenvalues Xj of p, but does not contain the 
pole at z = —Xj for any j, and, more generally, a Riemann integral representation, 



(3.212) 



Both of these expressions can be checked easily enough by writing them out in a basis that diago- 
nalizes p. 

However, these two representations really lead nowhere. Seemingly the best one can do with 
them is use them to derive a doubly infinite (to be explained momentarily) Fourier sine series 
expansion for L{t): 



oo 



b„i sm{mTTt) 



(3.213) 



This unfortunately does not have the compelling conciseness of Eq. ( ^.203| ), but at least it does 
automatically satisfy the boundary conditions. Perhaps one can hope that only the first few terms 



in Eq. ( 3.213 ) are significant. 

Luckily it turns out that there are better, somewhat nonstandard representations of Cp to work 
with. Nevertheless, before working toward a better representation, we go through the exercise of of 
building the Fourier expansion to illustrate the difficulties encountered. The idea is to start off with 
a Taylor expansion for L"{t) about t = and then use that in finding the coefficients bm — these 
being given by the standard Fourier algorithm, 



(mvr) 



The Taylor series expansion for L"{t) is 



oo 



n=0 



L"{t) sin(m7rt) dt . 



(3.214) 



tr 



^p(A) 



(3.215) 



74 



Now using the operator-inverse differentiation formula Eq. (3.157) within the integral of Eq. (3.211) 
one finds the following. Differentiating once with respect to t generates one term of the form 



{zi - p)-^A{zi - p)-^A{zi + p)-^ = [A(zi - p)-^] ^ \k{zi + p)-^' 
and one term of the form 

- {zi - p)-^k{zi + py^k{zi + py^ = -a-^ \A{zi - py^] \A{zi + p)~^ 

Differentiating again generates two terms of the form 



A 



-1 



A(zi-p)"^ A{zl + p) 



two of the form 



and two of the form 



A 



A(zi-p)"^l \A{zi + p)-^ 



A 



-1 



A{zl-py^ A{zl + p) 



The pattern quickly becomes apparent; in general, one has: 



J" 

A—Cf,{A) = 2{n\)Y.{-lfDf,{n;k) 
^'^ k=0 



where 



Dp{n;k) = ^.j> A{zt-py 



n+l-k 



A{zi + pY 



k+1 



dz . 



(3.216) 
(3.217) 

(3.218) 

(3.219) 
(3.220) 

(3.221) 
(3.222) 



Putting Eqs. ( t3.214| ) through ( 3.222| ) together, one arrives at an expression for the expansion 
coefficients, 



-'m — o Q 



n=0 



1 



(n-jV- 



b{j; m) 



'ft 

Y,{-l)hT(Df,,{n;k)) , (3.223) 



fc=0 



where 



b{j;m) = (-1)^/2 [1 + {-iy]{rmr)-^ . (3.224) 

The calculation of the terms ti (^Dpg{n; k)^ in this expansion is straightforward but tedious. Because 
each bfn itself can only be written in terms of an infinite series, the Fourier sine series here is dubbed 
a doubly infinite series. 

A Better Way 

It turns out that one integration of the operator Cp{A) is enough to give an explicit expression for 
L(i) instead of the two integrations naively called for in the definition of L"{t): 



t rt' 



L{t) = -tr A / / £p(^t"){^)dt"dr + cit + ca , 



(3.225) 
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where now we are making the integration variable in p{t) expUcit. This reduction to one integration 
greatly simplifies things already, and we develop it now. The point can be seen from an integration 
by parts: 



/t ft ft' 

Cjn^t>){^)dt' - A / / Cji(t"){A)dt"dt' . (3.226) 

Similarly 

fc-pi^t'){^)p{t') dt' = {^fc-p(t,){A) dt'^ p{t) - y'y*£^(t.)(A) dt"dt'^ A . (3.227) 

Adding these expressions together and using the fact that 

pCp{A) + Cp{A)p = 2A , (3.228) 

we get, 

ff i(A£^(,„)(A) + £^(t.)(A)A) dt"dt' 

= ^ |^/}(^y'£^(j,)(A) dt'^ + (^y'£^(,,)(A) df^^ -tA + rj, 

(3.229) 

where 57 is a constant operator. Therefore, using the linearity and the cyclic property of the trace, 
we obtain ^ 

L{t) = -tr(^/5 J Cp^t,){A)dt'^ + at + C2 . (3.230) 
Kronecker Product Methods 

There is another way of looking at the equation defining Cp{A) [Eq. ( |3.177| )] which leads to a way 
of calculating the operator 

H=f Cp{A)dt' . (3.231) 



The resulting expression is not particularly elegant, but it does give an operational way of carrying 
out this integral. 



The method is to think of Eq. (3.177) quite literally as a set of simultaneous linear equations, 
the solution of which defines the matrix elements of £p. Once one realizes this, then there is some 
hope that Eq. (|T7^) may be rearranged to be of a form more amenable to standard linear algebraic 



techniques. This can be accomplished by introducing the notion of uec'ing a matrix [142, 143 , 132]. 
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The operation of vec'ing a matrix is that of forming a vector by stacking all its columns on top 
each other. That is to say, if the elements of a matrix A are given by Aij, 1 < i,j < D, i.e., 



A 



All Ai2 
A21 A22 



Aid 
A2D 



(3.232) 



Adi Ad2 ■ ■ ■ Add 
then the (column) vector associated with the vec operation on A is given by 

All 
A21 



vec(^) 



Adi 
A12 

Ad2 



Ad 



D 



(3.233) 



Note that the vec operation is not a simple mapping from operators to vectors because it is basis 
dependent. 

We shall need two simple properties of the vec operation. The first, that it is linear, is obvious. 



The second is not obvious, but quite simple to derive 132 , p. 255]: for any three matrices A, B, 
and X, 

vec{AXB) = {B^ ® A)vec(X) , (3.234) 
where B^ denotes the transpose of B and denotes the Kronecker or direct product defined by 



A(S,B 



AiiB A12B ••• AidB 

A21B A22B ■ ■ ■ A2dB 

AdiB Ad2B ■ ■ ■ AddB 



(3.235) 



Choosing a particular representation for the operators in the Lyapunov equation Eq. (3.178) 
and letting I be the identity matrix, we see that it can be rewritten as 



BXI + IXC = D 



(3.236) 



Using Eq. ( p.234| ) on this, we have finally that it is equivalent to the system of linear equations 
given by 

[l®B + /)vec(X) = vec(L>) . (3.237) 

A combination of matrices like on the left hand side of this is called the Kronecker sum of B and 
C"^ and is denoted by -B © C"^ . Therefore, using this notation, when B © C"^ is invertible we have 



vec(X) = (b® C'^) Vec(L») . 



(3.238) 



Let us now apply these facts and notations toward finding an explicit representation for the 
operator H in Eq. ( p.231| ). We start by picking a particular basis independent of t, in which to 
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express the operators poi Pi-, P, ^p(^)) ^iid H. Let us denote the matrix representations of these 
by the same symbol but with the hat removed, except for Cp{A) which we represent by a matrix 
X. Since all these matrices are Hermitian, we have that = p*, etc., where the * just means 
complex conjugation. (As an aside, note that for any Hermitian matrix A, the matrix A* is also 
Hermitian. Moreover, for any positive semidefinite matrix B, the matrix B* is positive semidefinite. 
This follows because any positive semidefinite matrix B can be written in the form B = UDU^ 
for some diagonal matrix D = D* and some unitary matrix U . Therefore B* = {U*)D{U*)'^ is a 
positive semidefinite matrix because U* is a unitary matrix.) Then, for invertible po and p\, we 
have immediately that 

vec(X) = 2{p® p*)~Vec(A) , (3.239) 



where the matrix p(B p* is positive definite. Finding the matrix H simply corresponds to unvec'ing 
the vector ^ 

' V © P*y^dt'] vec(A) . (3.240) 



Our problem thus reduces to evaluating the operator integral in Eq. (|3.240 ). For this purpose, 
we rearrange the integrand as follows: 



= P.'" {i + t (po ^/^Ap-/^) )" p,^'' , (3.241) 

where 

Po = po® pI and A = A © A* , (3.242) 



and / is the appropriate sized identity matrix. If we suppose that A is invertible, Eq. ( 3.2411) can 
;ed immediately. This is because the matrices / and pg ^^^Apg commute 

/ + t' (po-/^Apo-/^) V^t' = {p,'''Ap,''')-\4i + tip^^'^Ap,'^ 



(3.243) 

and so 

vec(if) = 2p-'/\p-^/^Ap-^'')~\u{l + t(po^/^Ap-^/^))p-^/%ec(A). 

(3.244) 



The logarithm in this is well defined because the operators in its argument are invertible [|132| , p. 
474]. 



Eq. ( 3.244 ) contains the algorithm for calculating H in the given basis. This then may be used 
in conjunction with Eq. ( p.23Cl| ) to calculate L{t). An interesting open question is whether there 
exists a basis in which to represent the operators Pi) etc., so that the matrix appearing in the 
logarithm above is diagonal. If so, this would greatly simply the computation of L{t). 
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3.5.4 The Two-Dimensional Case 

In this Subsection we consider the important special case of binary communication channels on two- 
dimensional Hilbert spaces. Here the new bounds are readily expressible in terms of elementary 
functions and, moreover, the optimal orthogonal projection- valued measurement can be found via 
a variational calculation (just as was possible with the statistical overlap). With this case, one can 
gain a feel for how tightly the new bounds delimit the true accessible information I{t). 

Let the signal states po and pi again be represented by two vectors within the Bloch sphere, 

i.e., 

1 

Po 



and pi = - [1 + b ■ a 



The total density matrix for the channel can then be written as 

. 1 



P 



where 



and 



1 + c-a 



= {l-t)a + tb 

= a + td 

= b-{l-t)d 

d = b — a . 



(3.245) 



(3.246) 



(3.247) 
(3.248) 



For an orthogonal projection- valued measurement specified by the Bloch vector n/n, the mutual 
information takes the form 



Jit; n) = (1 - t) K{po/p; n) + t K{pi/p; n) , 



where 



and 



K{po/p; n) 



1 

2n 



+ a ■ In ^ 



n + a ■ n 
n + c - fi 



+ (n — a • n I In 



K{p,/p,n) = - 



{n + b ■ fij\n.\ 



n + b-n\ ( - ^, , , 

~ — =; -|- I n — • n m 

n + c ■ n I V 



n — a ■ n 



n — c - n 



n — b ■ n 



n — c - n 



(3.249) 



(3.250) 



(3.251) 



The optimal projector is found by varying expression (I3.249D over all vectors n. The resulting 
equation for the optimal n is 



= (1 - t) In 



(ji -\- c- 71^(^1 — a ■ n] 



(ji — c- fij(ji + a ■ n] 



a± + t In 



(ji + c ■ fij (ji — b ■ n] 



(ji — c ■ fij (ji + b ■ n] 



b± 



where 



and 



a± = a 









a 




)- 






1 n 


b- 


n\ 


fi 




n ) 


n 



(3.252) 

(3.253) 
(3.254) 
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are vectors perpendicular to n. Equation ( p.252| ) is unfortunately a transcendental equation and as 
such generally has no explicit solution. That is really no problem, however, for given any particular 
a, b, and t, a numerical solution for n can easily be computed. Nevertheless, it is of some interest 
to classify the cases in which one can actually write out an explicit solution to Eq. ( 3.252 ). There 
are four nontrivial situations where this is possible: 

1. a classical channel, where po and pi commute (i.e., a and b are parallel), 

2. Po and pi are both pure states (i.e., a = b = 1), 

3. a = b and t = ^, and 

4. t is explicitly determined by po ^-nd pi according to 

i=\^ + \i^^] ■ (3-255) 



In case (1), taking n/n parallel to a and b causes a± and b± to vanish. When conditions (2)- 
(4) are fulfilled, Eq. (p5|) can be solved by requiring that the arguments of the logarithms be 
multiplicative inverses, i.e.. 



(ji -\- c ■ fij (ji — a ■ fi^ (^n — c- fi^(^n + b ■ fi^ 
(ji — c - fij(ji + a ■ fi^ (^n + c ■ f{j (^n — b ■ fij 



and choosing n such that 



[1 — t)d± = tb^ 



(3.256) 



(3.257) 



Cases (2) and (3), reported previously by Levitin [10£], are limits of case (4). 

The condition of Eq. ( 3.2571 ) in the exactly solvable cases (2)-(4) is equivalent to the requirement 
that 

fi = tb- {l-t)a . (3.258) 

This is of significance because, for arbitrary t, the measurement that minimizes the probability 
of error measure of distinguishability studied in Section 3.2 is just the measurement specified by 
Eq. ( 3.258[) |76, 27]. Thus, in the cases where the optimal information gathering measurement can 
be written explicitly, it coincides with the optimal error probability measurement. Now, simply 
rewriting t and 1 — t in terms of the vectors a, b, and c, the vector ( |3.258| ) becomes, 



n 



(a-c) bjb-c) 



2 -(^d —b^ b -(S — d^ 
In cases (l)-(3) this can be seen to reduce to 



(3.259) 



no = (^1 — d ■ c^b 

= d + cx(^cx . 



1 — b ■ c] a 



(3.260) 



(It should be kept in mind that in going from Eqs. (|3.258D to ( |3.259| ) to ( ^.26C| ) the length of n has 
been allowed to vary.) 
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Case (2), and consequently the orthogonal projection- valued measurement set by Eq. ( |3.260| ), is 
of particular interest in communication theory. This is because two pure states in any Hilbert space 



span only a two-dimensional subspace of that Hilbert space. Hence Eq. (3.260) remains valid as 
the optimal orthogonal projection- valued measurement for a pure-state binary channel in a Hilbert 
space of any dimension. Moreover, Levitin [|110|] has shown that in this case the optimal orthogonal 
projection- valued measurement is indeed the actual optimal measurement. 

Let us write out this case in more standard notation. Suppose po = |V'o)(V'ol ^-nd pi = 
and let 

= tr(/5o/5i) 



i(l + a.6 



(3.261) 



Then, because a = b = 1 here, the optimal measurement given by Eq. ( 3.258| ) has norm 



n 



^l-4t{l-t)q . (3.262) 
Using this and some algebra, the expression for the accessible information I{t) reduces to 



m 



1 

2n 



(1-t) 



(n + l- 2tqj In 



1+n 
2(1 -i) 



+ (n-l + 2tq] In 



1 — n 
2(1 - t) 



+ t 



(n+l-2(l-t)g)ln[^-i^] + (n - 1 + 2(1 - t)g) In | 



1 — n 
2t 



(3.263) 



This confirms Levitin's expression in Ref. |10£] though this version is significantly more compact. 

With this much known about the exact solutions, let us return to the question of how well the 
bounds M{t), L(t), etc., fare in comparison. For a measurement specified by the Bloch vector n/n, 
the second derivative of the mutual information takes the form 



fit) 



(d • nf 



(3.264) 



The vector n that minimizes this is again given easily enough by a variational calculation; the 
equation specifying the nontrivial solution is 



n 



(c • d)^^ d - (cT- nj (^n - • njc^ = . 

After a bit of algebra one finds its solution to be given by none other than 

no = (^1 — a ■ b — (^l — b ■ a , 



(3.265) 



(3.266) 



the vector given by Eq. ( 3. 2601) ; this time, however, the expression is valid for all a, 6, and t. 
Inserting this vector into expressions ( p.250| ) and ( p. 251 ) produces the lower bound M(t), 



M{t) = il-t)Kipo/p;no) + tK{pi/p;no) . 



(3.267) 



81 



The upper bound L{t) is found by integrating J"{t) back up but with this particular measure- 
ment in place; that is to say, by integrating 



L"it) = -{d^ + ^{c.d) 
The result, upon requiring the boundary conditions L(0) = L(l) = 0, is 

L{t) 
where 



6 

2d2 



(^6 - c • (?) ln(5 - c • (?) - ((5 + c • (?) ln((5 + c • (?) + Pit + P2 
P2 = (^(5 - d • (?) ln(^6 - d • (?) + (^6 + a-d) ln(^6 + a ■ (tj , 



and 
and 



P^ = [5 -h- d]\n{5 -h- d] + [5 + h- d]\n{5 + h- d] - p 



(3.268) 

(3.269) 

(3.270) 
(3.271) 



5 = y(l-d-6)^-(l-a2)(l-62) 



d2 



c X d 



d2 



d X 6 



(3.272) 



In contrast, the Jozsa-Robb-Wootters lower bound and the Holevo upper bound are given by 
Eqs. ( |3.151 ) and ( 3.149 ), respectively, where in Bloch vector representation 



and 



Q{p) 



Sip) 



.i(l-.).n(i^ 



(3.273) 



(3.274) 



and similarly for Q{po), S{po), etc. The extent to which the bounds M{t) and L{t) are tighter than 
the bounds Q{t) and S{t) and the degree to which they conform to the exact numerical answer I{t) 



is illustrated by a typical example in Fig. 3.2 



3.5.5 Other Bounds 

Another Bound from the Schwarz Inequality 

Because of the difficulties in constructing a concise expression for L{t), it is worthwhile to look at 
another upper bound to I{t) derived from the Schwarz inequality. This is a bound derived from 
the second usage of lowering operators Qp, i.e., that in Eq. ( 3.1761) , but with £^ in particular. We 
can immediately write 

(3.275) 



j"(t)>-tr(£^V2(A)2)^iV"(t), 



since £^1/2 (A) is Hermitian and guaranteed to exist for the same reason £p(A) does. The right 
hand side of this, when integrated twice and required to meet the boundary conditions 



NiO) = Nil) = 1 , 



(3.276) 
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nats 



0.15 




0.05 



0.2 



0.4 0.6 
prior probability t 



0.8 



Figure 3.2: The Holevo upper bound S{t), the upper bound L{t), the information I{t) extractable 
by optimal orthogonal projection- valued measurement (found numerieally) , the lower bound M{t), 
and the Jozsa-Robb-Wootters lower bound Q{t), all for the case that po is pure (a = 1), pi is mixed 
with 6 = 2/3, and the angle between the two Bloch vectors is 7r/3. 
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gives a new upper bound N(t). 

Since the bound derived from Eq. ( |3.275 ) is necessarily worse than L{t), this would be of little 
interest if it were not for the following fact. Consider any positive operator X that depends on a 
parameter t. Then 



dX 
~dt 



d^X dVX 
X^— + 



dt 



Thus, 



and specifically, 



dVX 

dt 



N"{t) 



dt 



dX 



X . 



tr 



d\fp 
dt 



(3.278) 



(3.279) 



The simplicity of this expression would apparently give hope that there may be some way of 
integrating it explicitly. 

The reason it would be useful to find X{t) is that, though it cannot be as tight as the L(t) 
bound, it still beats the Holevo bound S{t). This can be seen as follows. In a basis |j) diagonalizing 

/ 2 



N"{t) 



E 



{i,fe|Aj+Afc7^0} 



jk\ 



(3.280) 



Now, as in the proof of the Holevo bound, S(t) > N{t) requires that 

/ \ 2 

2 \ 



$(x,y) > 



(3.281) 



This can be shown very simply. The case for $(2;, x) is again automatic. Suppose < x,y < 1 and 
X ^ y. We already know that <I>(x, y) > 2/(x + y). Consequently, 



But that is to say. 



i(ln2; — Iny) 



> 



(3.282) 



(3.283) 



Multiplying both sides of this by 2/(-y/x + y^) gives the desired result. 

Unfortunately, despite the seeming simphcity of Eq. ( |079|) , little progress has been made 
toward its general integration. This in itself may be an indication that the simplicity is only 
apparent. This can be seen with the example of 2 x 2 density operators. Let us report enough of 
these results to make this point convincing. 

The operator ^/p can represented with the help of the unit operator and the Pauli matrices as 



rol + f- a, 



(3.284) 
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where the requirement that squares to give the representation of p in Eq. (3.246) forces 



rl = \ (3.285) 

and 

2rof='=^c. (3.286) 

These two requirements go together to give a quartic equation specifying tq. Since y/p must be a 
positive operator, its eigenvalues ro + r and r^ — r must both be positive. Hence, picking the largest 
value of ro consistent with the quartic equation, we find 



and 



1 / I \l/2 

- (l + Vl^) (3.287) 



^ c. (3.288) 



4ro 

Taking the derivative of these quantities with respect to t, we find that 



2 

which reduces after quite some algebra to 

r2 I 1 - cH 8ro 



N"{t) = -U{r',f +r'.r') , (3.289) 



and really cannot be reduced any further. Clearly this is a considerable mess as a function of t, 
and the actual expression for N{t) is far worse. 

A Bound Based on Jensen's Inequality 

Still another upper bound to the accessible information can be built from the technique developed 
to find the function L"{t). This bound makes crucial use of the concavity of the logarithm function. 

For simplicity, let us suppose that both po and pi are invertible. Recall the representation, 
given by Eq. (3.144), of the mutual information as the average of two Kullback-Leibler relative 



informations. For each of these terms, we have two different bounds that come Jensen's inequality 
p7[| , i.e., for any probability distribution q{b), 

J2q{b)lnxk < ln(^g(6)xJ . (3.291) 

b \ b / 

The first bound is that 



K{pi/p) = ^K(6)ln 



b 
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The second is that 



K{pi/p) 



V \Pi{h)J 



^ -i:-<')K(sl) 

> -2 \n{^^p,{h)p{h)^ 



1/2' 



(3.293) 



Therefore, it follows that the quantum Kullback information K{pi/ p) is bounded by 



-2 ln(mm^VteG54^V^te(^) < K{pi/p) < In 



I 



(tr{piEh 



maxT^ 

{Et} V tr(/5^6; 



(3.294) 



These bounds can be evaluated explicitly by the techniques of Sections and |3.5.1| . Namely, we 
have 

' ' ' (3.295) 



2 1n(tr 7/5V2p,pi/2j < ^(^./^) < ln(tr(/},£^(A 



(The upper bound is assured to exist in the form stated because Cp{f)i) is itself well defined; this 
follows because the pi are assumed invertible.) 

One of these bounds may be used to upper bound the accessible information. In particular, 
since the maximum of a sum is less than or equal to the sum of the maxima, it follows that 



I{t) < (1-0 ln(^tr(po/:p(po))j + t ln(^tr(pi£p(pi))j = R{t) . (3.296) 
To see how the bound R{t) looks for 2-dimensional density operators, note that in this case 



Cp{pQ) can be written in terms of Bloch vectors as [144] 



SqI + Rq-B 



where 



and 



^0 - 



1 — d ■ c 



Rn 



a — Sqc . 



(3.297) 

(3.298) 
(3.299) 



A similar representation holds for Cp{pi); one need only substitute b for a above. Then, we get 



tr 



Sq + a ■ Rq 
1 



l-c2 



and similarly 



tr 



(piCpipi)) = 



1 — a ■ c] + a , 



I - b-c] + ¥ . 



(3.300) 
(3.301) 



Substituting these into Eq. ( |3.296D gives the desired result. 
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A Bound Based on Purifications 



Imagine that states po and pi come from a partial trace over some subsystem of a larger Hilbert 
space prepared in either a pure state |V'o) or \tpi). That is to say, let \ipo) and iV^i) be purifications 
of po and pi, respectively. Then any measurement POVM {Eb} on the original Hilbert space can 
be thought of as a measurement POVM {Ef,0 1} on the larger Hilbert space that ignores the extra 
subsystem. In particular, we will have that the measurement outcome statistics can be rewritten 



as 



tv{psEh 



tr 



\i's){i's\ [Eb<S)i 



(3.302) 



for s = 0, 1. (The trace on the left side of this equation is taken over only the original Hilbert 
space; the trace on the right side is taken over the larger Hilbert space in which the purifications 
live.) Similarly, we have for the average density operator p that 



tr{pEb) =tr(p^^(^fc0i 



(3.303) 
(3.304) 



where 

is the average density operator of the purifications. 

Let us now denote the mutual information for the ensemble consisting of po and pi with respect 
to the measurement {Eb} by 

j[po,pi;t;{Eb}) . (3.305) 



It follows then that 



-^(PoIpi) 



max J po,Pi;i; {Eb} 
{Et} ^ 



max J( 1-01 
{Et} ^ 



0/ 



\4jiy,t;{Eb^i}) 



Laxj(|?/.o),|V5i);i;TO) 
/(l^o)||^ 



< max 
{E 



(3.306) 



That is to say, the accessible information of the original ensemble of states must be less than or 
equal to the accessible information of the ensemble of purifications. 

This is of great interest because we already know how to calculate the accessible information 



for two pure states. It is given by Eq. (3.263) with 

9 = KV^olV^i)P 



(3.307) 



This observation immediately gives an infinite number of new upper bounds to the accessible 
information — one for each possible set of purifications. The one of most interest, of course, is the 
smallest upper bound in this class. 

Clearly then, the larger the overlap between the purifications iV'o) and iV'i)) the tighter the bound 
will be. For the larger the overlap, the less the distinguishability that will have been added to the 



purifications above and beyond that of the original states. We need only recall from Section 3.2 
that the largest possible overlap between purifications is given by 



'? = l(V^o|^i)|' 



tr 



1/2^ ,1/2 
Pi POPl 



(3.308) 
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Use of this q in Eq. ( 3.263 ) gives the best upper bound based on purifications. This bound we shall 
denote by P{t). When po find pi are both very close to being pure states, this bound can be very 
tight. 



3.5.6 Photo Gallery 

In the following pages, several plots compare the bounds on accessible information derived in the 
previous sections. The plots were generated by Mathematica"*"^ with the following code. 

(* Initial Notation *) 
av = aa { 1, 0, } 

bv = bb { Cos[theta], Sin[theta] , } 

dv = bv - av 

cv = av + t*dv 

a = Sqrt [av . av] 

b = Sqrt[bv.bv] 

c = Sqrt[cv.cv] 

d = Sqrt[dv.dv] 

(* Holevo Upper Bound *) 

SC = -( (l-c)*Log[(l-c)/2] + (l+c)*Log[(l+c)/2] )/2 

SB = -( (l-b)*Log[(l-b)/2] + (l+b)*Log[(l+b)/2] )/2 

SA = -( (l-a)*Log[(l-a)/2] + (l+a)*Log[(l+a)/2] )/2 
SS = SC - (l-t)*SA - t*SB 

(* Jozsa-Robb-Wootters Lower Bound *) 

QC = ( ((l-c)-2)*Log[(l-c)/2] - ((l+c)-2)*Log[(H-c)/2] )/(4*c) 

QB = ( ((l-b)-2)*Log[(l-b)/2] - ((l+b)-2)*Log[(l+b)/2] )/(4*b) 

QA = ( ((l-a)-2)*Log[(l-a)/2] - ( (1+a) "2) *Log [(1+a) /2] )/(4*a) 
QQ = QC - (l-t)*QA - t*QB 

(* Lower Bound M(t) *) 
mv = (1 - av.cv)*bv - (1 - bv.cv)*av 
m = Sqrt [mv . mv] 

MA = ( (m + av.mv)*Log[(m + av.mv)/(m + cv.mv)] + 

(m - av.mv)*Log[(m - av.mv)/(m - cv.mv)] )/(2*m) 
MB = ( (m + bv.mv)*Log[(m + bv.mv)/(m + cv.mv)] + 

(m - bv.mv)*Log[(m - bv.mv)/(m - cv.mv)] )/(2*m) 
MM = (l-t)*MA + t*MB 

(* Upper Bound L(t) *) 
Ld = Sqrt[ (1 - av.bv)"2 - (1 - a-2)*(l - b-2) ] 
LA = (Ld - av.dv)*Log[Ld - av.dv] + (Ld + av.dv)*Log[Ld + av.dv] 
LB = (Ld - bv.dv)*Log[Ld - bv.dv] + (Ld + bv.dv)*Log[Ld + bv.dv] 

- LA 

LC = -(Ld - cv . dv) *Log [Ld - cv.dv] - (Ld + cv.dv)*Log[Ld + cv.dv] 
LL = ( Ld/(2*d"2) )*( LC + t*LB + LA ) 
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(* Upper Bound R(t) Based On Jensen's Inequality *) 
RA = a-2 + ((1 - av.cv)-2)/(l - c~2) 
RB = b-2 + ((1 - bv.cv)-2)/(l - c~2) 
RR = (l-t)*Log[RA] + t*Log[RB] 

(* Upper Bound P(t) Based On Purifications *) 
qq = (1 + av.bv + Sqrt [l-a"2] *Sqrt [l-b"2] ) /2 
p = Sqrt[l - 4*t*(l-t)*qq] 

PA = (p + 1 - 2*t*qq)*Log[(l + p) / (2* (1-t) ) ] + 
(p - 1 + 2*t*qq)*Log[(l - p) / (2* (1-t) ) ] 

PB = (p + 1 - 2*(l-t)*qq)*Log[(l + p)/(2*t)] + 
(p - 1 + 2*(l-t)*qq)*Log[(l - p)/(2*t)] 

PP = ( (l-t)*PA + t*PB )/(2*p) 



3.6 The Quantum Kullback Information 

The quantum Kullback information for a density operator po relative to density operator pi is 
defined to be 

K{po/p,) ^ max ^ tr(/5oib) Inf^^^l ' (3.309) 

where the maximization is taken over all POVMs {Ef,}. This quantity is significant because it 
details a notion of distinguishability for quantum states most notably in the following way. (For 
other interpretations of the Kullback-Leibler relative information, see Chapter 2.) 

Suppose N^l copies of a quantum system are prepared identically to be in state pi. If a POVM 
{Eb ■ b = 1, . . . , n} is measured on each of these, the most likely frequencies for the various outcomes 
b will be those given by the probability estimates pi{b) = tr{piEf,) themselves. All other frequencies 
beside this "natural" set will become less and less likely for large as statistical fluctuations in the 
frequencies eventually damp away. In fact, any set of outcome frequencies {f{b)} — distinct from 
the "natural" ones {pi(b)} — will become exponentially less likely with the number of measurements 
according to [10| 

g-7Vi^(//pi)-n,in(7V+i) < pRQB (freq = {/(6)} I prob = {pi{b)}) < e-^^(^/Pi) , (3.310) 

where 



n 



6=1 



K{f/Pi) = y:f{b)ln(^) (3.311) 



Piib)J 



is the Kullback-Leibler relative information [[T^] between the distributions f{b) and pi{b). There- 
fore the quantity K{f/pi), which controls the leading behavior of this exponential decline, says 
something about how dissimilar the frequencies {/(6)} are from the "natural" ones {pi(b)}. 

Now suppose instead that the measurements are performed on quantum systems identically 
prepared in the state pQ. The outcome frequencies most likely to appear in this scenario are 
those specified by the distribution po{b) = tr:{poEi,). Therefore the particular POVM E^ satisfying 
Eq. ( ^.309| ) has the following interpretation. It is the measurement for which the natural frequencies 
of outcomes for state po are maximally distinct from those for measurements on pi , given that pi is 
actually controlling the statistics. In this sense, Eq. (1^) gives an operationally defined (albeit 
asymmetric) notion of distinguishability for quantum states. 
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information 




prior probability t 



Figure 3.3: All the bounds to accessible information studied here, for the case that pQ is pure 
(a = 1), pi is mixed with h = 2/3, and the angle between the two Bloch vectors is 7r/4. 
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information 




prior probability t 



Figure 3.4: All the bounds to accessible information studied here, for the case that po and pi 
are pure states (a = 6 = 1) and the angle between the two Bloch vectors is 7r/4. For this case, 

M{t) = p{t)=m. 
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prior probability t 



Figure 3.5: All the bounds to accessible information studied here, for the case that po and pi are 
mixed states with a = | and b = ^ and the angle between the two Bloch vectors is 7r/3. 
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Figure 3.6: All the bounds to accessible information studied here, for the case that po and pi 
are pure states (a = 6 = 1) and the angle between the two Bloch vectors is 7r/3. For this case, 

M{t) = p{t)=m. 
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Figure 3.7: The bounds S{t), L{t), M{t), and P{t) for the case that po and pi are mixed states 
with a = ^ and 6 = f and the angle between the two Bloch vectors is 7r/5. 
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Figure 3.8: The bounds S{t), L{t), M{t), and P{t) for the case that po and pi are mixed states 
with a = 6 = I and the angle between the two Bloch vectors is 7r/4. 
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The main difficulty with Eq. (3.309) as a notion of distinguishabiUty is that one would like an 
explicit, convenient expression for it — not simply the empty definer given there. Chances for that, 
however, are very slim. For just by looking at the 2x2 density operator case, one can see that the 
POVM optimal for this criterion must satisfy a transcendental equation. For instance, by setting 
the variation of Eq. ( 3.250|) to zero, we see that an optimal orthogonal projection- measurement ft 
must satisfy, 



In 



n + a- n][n — c-n] 



(ji — a ■ fij(ji + c- fij 



I 



a± + 2n 



c ■ n — a ■ n 



0, 



(3.312) 



where the vectors a±_ and c±_ are defined as in Eq. ( 3.253| ). This may well not have an explicit 
solution in general. We are again forced to study bounds rather than exact solutions just as was 
the case for accessible information. 

So far, several bounds to the quantum Kullback information have come along incidentally. There 
are the lower bounds given by Eqs. ( |3^ and ([3^20^ , which wih soon be revisited in Section 13.6.21. 



And there are the upper and lower bounds due to Jensen's inequality given in Eq. ( 3.295 ). In the 
remainder of this Section, we shall detail a few more bounds, both upper and lower. 



3.6.1 The Umegaki Relative Information 

The Umegaki relative information between po and pi is defined by 

-?^u (/5o //5i ) = tr ( po In po - Po In pi 



(3.313) 



This concept was introduced by Umegaki in 1962 | 125 |, and a large literature exists concerning it. 
Some authors have gone so far as to label it the "proper" notion of a relative information for two 



quantum states |145| , 126 1. 

As we have already seen, the Holevo upper bound to mutual information is easily expressible 
in terms of Eq. (p.313| ). This turns out to be no surprise because, indeed, this quantity is an upper 
bound to the quantum Kullback information itself [146|. This will be demonstrated directly from 
the Holevo bound in this Section. 

For simplicity, let us assume that both po and pi are invertible. Let {Eh} be any POVM and 
Po{b) and pi{b) be the probability distributions it generates. Suppose < t < 1. Then the Holevo 
bound Eq. (|3.210D can be rewritten as 



K{pq/p) + 



t 



-Kipi/p) < Kv{po/p) + 



1-t 

Taking the limit of this as t approaches 1, we obtain 

t 



l-t 



Kvipi/p) ■ 



t 



Kipo/pi) + lini- -K{pi/p) < Ku{po/pi) + lini -Ki]{pi/p) 



(3.314) 



(3.315) 



It turns out that the limits yet to be evaluated vanish; let us show this. Note that, trivially, 

lim(l-t) = VimK{pi/p) = limKu(pi//5) = 0. (3.316) 

Therefore we must use I'Hospital's rule to evaluate the desired limits. In the first case, this readily 
gives 

K{pi/p) = -\un[K{pi/p) +t—K{pi/p) 



lim 
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b 

= tr(A) = 0. (3.317) 
The second case is slightly more difficult. I'Hospital's rule gives 

= lim ^tr(pilnp) , (3.318) 

but there is still some work required to evaluate the last expression. 

Recall that the operator In p may be represented by a contour integral, 

In p = / In z (zi - f}) '^dz , (3.319) 



2iTi Jc 

where the contour C encloses all the eigenvalues of p (for all possible values of t). Then 

^\np = — I Inz (zi - p\~^ k{zi - p\~^ dz , (3.320) 
dt 2m Jc ^ ^ ^ ' 

so that 

lirn ^ tr(pi In p) = -^-^ ^ In z ti ^(^zl — pij pi(^zl — pi^ dz 

where arc the eigenvalues of pi and A^k are the matrix elements of A in a basis that diagonalizes 
pi. By the Cauchy integral theorem, however, 

^ ^ ^""^ dz = ^. (3.322) 



27ri Jc {z - Afc)2 Afe 
Therefore 

d ^ 
lini — tr(pi Inp) = ^ Afe^ 
dt 

= tr(A) = 0. (3.323) 

This proves that the Kullback-Leibler relative information for any measurement is bounded above 
by the Umegaki relative information. In particular, this places an upper bound on the quantum 
KuUback information, 

K{po/pi) < Kvipo/Pi) ■ (3.324) 

Moreover, since the Holevo bound on mutual information is achievable if and only if po and pi 
commute, it follows that there is equality in this bound if and only the density operators commute. 
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3.6.2 One Million One Lower Bounds 



It appears to be productive to ask, among other things, if there is any systematic procedure for 
generating successively tighter lower bounds to K(pq/pi). In particular, one would like to know if 
there is a procedure for finding lower bounds of the form 



tr 



(/5olnA(/5o/pi)) , 



(3.325) 



where A(/)o/pi) is a Hermitian operator that depends (asymmetrically) on po and pi. 
We already know of two such bounds. The first [|147| , |148(| is given by Eq. ( ^.204| ), 



Kf{po/pi) = trLooln(£pi(po) 



where £p^(/5o) | |122| ] is an operator X satisfying the equation 

Xpi + piX = 2po . 
The second |14S, 88 1 is given by Eq. ( 3.93| ), 



Kb{po/pi) = 2trl ,5o ln( Pi Vpi^^PoP^^^Pi 



(3.326) 



(3.327) 



(3.328) 



Both these bounds come quite close to the exact answer defined by Eq. ( p.6p when pq and pi are 
2x2 density operators [|150(| . Nevertheless, these approximations may not fare so well for density 
operators of higher dimensionality. 

The trick in finding expressions (|3l32^) and (|3^328|) was in noting that the eigenvalues \b of the 
operators 

Ai = Cp,{po) (3.329) 

and 

can be written in the form 



^ .-1/2 /.1/2. .1/2.-1/2 

M^Pi ypi popi pi 



(3.330) 



A? 



iT{pQEb) 



(3.331) 



tiipiEb) 

where the Ei, = \h){h\ are projectors onto the one-dimensional subspaces spanned by eigenvectors 
\h) of Ap, p = 1 or 2 respectively. Using this fact, one simply notes that 



tr 



(po InAj 



\ b 
( 

tr , „ 



(3.332) 



Eq. ( 3.331D is derived easily from the defining equation for Ai; one just notes that for an 
eigenvector |6) of Ai 



(6|Aipi|6) + (&|piAi|6) = 2(6|/5o|&) 



2Afe(6|pi|6) =2(6|po|6) . 



(3.333) 
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The corresponding fact for A2 was derived from more complex considerations in Section A 
much simpler way to see it is by noting that A2 satisfies the matrix quadratic equation 



XpiX = po . (3.334) 

(This can be seen by inspection.) Then along similar lines as above, if one takes \b) to be an 
eigenvector of A2 , one gets 

{b\A2pMb) = {b\po\b) =^ Xl{b\pi\b) = {b\po\b) . (3.335) 

Now that the common foundation for both lower bounds ( |3.326| ) and ( |3.328 ) is plain, we are 
led to ask whether there are any other operators A(pq/pi) whose eigenvalues have a similar form, 
i.e., 

^^(MpoM)) - , (3.336) 

where Xb{X) denotes the 6'th eigenvalue of the operator X. If so, then 

InA(poM) = E Inf^^^V^ ' (3-337) 



b 



and we get the desired result, Eq. ( 3.325 ), via the steps in Eq. ( 3.332| ). 

Posed in this way, the solution to the question becomes quickly apparent. It is found by gen- 
eralizing the defining equations for the operators Ai and A2. For instance, consider the Hermitian 
operator X defined by 

]^{hX + Xpi)+XpiX = po. (3.338) 
If its eigenvectors and eigenvalues are |6) and Af,, one obtains (by the same method as before) 

Xl + Xb- (3.339) 



{h\k\h) _ ,2 



{b\Pi\b) 
Therefore, the operator 

A = X^ + X (3.340) 
has eigenvalues {tr poEb)/ (ti pi Eb), and we obtain another lower bound 

tr(/5oln(x2+X)) (3.341) 

to the quantum Kullback information. 

More interestingly, however, is that we now have a method for inserting parameters into the 
bound which can be varied to obtain an "optimal" bound. For instance, we could instead consider 
the solutions X^, to the (parameterized) operator equation 

l-a(piXa + Xa Pi) + (1 - a)Xa PiXa = Po , (3.342) 



2 

and thus get the best bound of this form by 



max trf^oln((l - a)Xl + aXa) ) • (3.343) 



99 



This bound has no choice but to be at least as good or better than the bounds Ky{pg/pi) and 
Kb{po/pi) simply because Eq. ( p.342| ) interpolates between the measurements defining them in the 
first place. Moreover Eq. ( 3.3421 ) is still within the realm of equations known to the mathematical 
community; methods for its solution exist |151, |152| , |153| , 154 ]. 

This pretty much builds the picture. More generally, one has measurements defined by 



X'piX^ + X^piX' 



Po 



(3.344) 



giving rise to lower bounds to the Quantum Kullback of the form 



tr Po ln( ^ttijX 



(3.345) 



(Here i and j may range anywhere from up to values for which the aij are no longer freely 
specifiable.) These may then be varied over all the parameters aij to find the best bound allowable 
at that order. To the extent that solutions to Eq. ( |3.344| ) can be found, even numerically, better 
lower bounds to the Quantum Kullback information can be generated. 



3.6.3 Upper Bound Based on Ando's Inequality and Other Bounds from the 
Literature 

Another way to get an upper bound on the quantum Kullback information is by examining 
Eq. ( |3.137 ), the bound on the quantum Renyi overlap due to Ando's inequality. With this, one 
immediately has 



-l^ln(£po(6)>i(6)-"j < ^lntr(pf 



(3.346) 



However, as a — > 1, the left hand side of this inequality converges to the Kullback-Leibler informa- 
tion. Therefore if we can evaluate the right hand side of this in the limit, we will have generated a 
new bound. Using I'Hospital's rule, we get 



hm RHS 



lim 



, 1/2/. -1/2. .-1/2\".1/2\1 \ 1/2/. -1/2. .-l/2\" /.-1/2. .-l/2\ .1/2 

tr( Pi [Pl POPl ) Pi ) ^^[Pi [Pi PoPi ) ^^[Pi PoPi jPl 



(3.347) 



Therefore, one arrives at the relatively asymmetric upper bound to the quantum Kullback infor- 
mation given by 

K{po/pi) < tr(^(pi/2po/5r'^')ln(pr'^'po/5r'/')) • (3-348) 

Note that when po a-nd pi commute, this reduces to the Umegaki relative information. 

Finally, we note that the last property of reducing to the Umegaki relative entropy is not an 
uncommon property of many known upper bounds to it. For instance it is known |155| that for 
every p > 0, 

Kvipo/Pi) < ^til^poln^pfpi^pf)^ , (3.349) 
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and all of these have this property. Alternatively, so do the lower bounds Eqs. (3.204) and ( p.93| ) 
to the quantum Kullback information derived earlier, as well as the lower bound |[l55| ] given by 



K{po/pi) > tr( polnfp^ ^/^poPi 



-1/2 



(3.350) 



This may or may not lessen the importance of the Umegaki upper bound, depending upon one's 
taste. 
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Chapter 4 

Distinguishability in Action 



"Something only really happens when 
an observation is being made .... Be- 
tween the observations nothing at all hap- 
pens, only time has, 'in the interval,' irre- 
versibly progressed on the mathematical 
papers!" 

— Wolfgang Pauli 
Letter to Markus Fierz 
30 March 1947 

4.1 Introduction 

In the preceding chapters, we went to great lengths to define and calculate various notions of 
distinguishability for quantum mechanical states. These notions are of intrinsic interest for their 
associated statistical problems. However, we might still wish for a larger payoff on the work 
invested. This is what this Chapter is about. Here, we present the briefest of sketches of what 
can be accomplished by using some of the measures of distinguishability already encountered. The 
applications concern two questions that lie much closer to the foundations of quantum theory than 
was our previous concern. 

"Quantum mechanical measurements disturb the states of quantum systems in uncontrollable 
ways." Statements like this are uttered in almost every beginning quantum mechanics course — it 
is part of the folklore of the theory. But what does this really mean? How is it to be quantified? 
The next two sections outline steps toward answers to these questions. 

4.2 Inference vs. Disturbance of Quantum States: Extended Ab- 
stract 

Suppose an observer obtains a quantum system secretly prepared in one of two standard but 
nonorthogonal quantum states. Qiiantum theory dictates that there is no measurement he can use 
to certify which of the two states was actually prepared. This is well known and has already been 
discussed many times in the preceding chapters. A simple, but less recognized, corollary to this 
is that no interaction used for performing such an information-gathering measurement can leave 
both states unchanged in the process. If the observer could completely regenerate the unknown 
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quantum state after measurement, then — by making further nondisturbing information-gathering 
measurements on it — he would be able eventually to infer the state's identity after all. 

This consistency argument is enough to establish a tension between inference and disturbance 
in quantum theory. What it does not capture, however, is the extent of the tradeoff between these 
two quantities. In this Section, we shall lay the groundwork for a quantitative study that goes 
beyond the qualitative nature of this tension.]^ Namely, we will show how to capture in a formal 
way the idea that, depending upon the particular measurement interaction, there can be a tradeoff 
between the disturbance of the quantum states and the acquired ability to make inferences about 
their identity. The formalism so developed should have applications to quantum cryptography on 
noisy channels and to error correction and stabilization in quantum computing. 



4.2.1 The Model 

The model we shall base our considerations on is most easily described in terms borrowed from 
quantum cryptography, though this problem should not be identified with the cryptographic one. 
Alice randomly prepares a quantum system to be in either a state pQ or a state pi. These states will 
be described by x density operators on an A^-dimensional Hilbert space, A^ arbitrary; there 
is no restriction that they be pure states or orthogonal for that matter. After the preparation, 
the quantum system is passed into a "black box" where it may be probed by an eavesdropper 
Eve in any way allowed by the laws of quantum mechanics. That is to say. Eve may first allow 
the system to interact with an auxiliary system, or ancilla, and then perform quantum mechanical 
measurements on the ancilla itself |156(| . The outcome of such a measurement may provide Eve 
with some information about the quantum state and may even provide her a basis on which to 
make an inference as to the state's identity. Upon this manhandling by Eve, the quantum system is 



passed out of the "black box" and into the possession of a third person Bob. (See related Figure 4.1 
depicting Eve and Bob only.) 

A crucial aspect of this model is that even if Bob knows the state actually prepared by Alice 
and, furthermore, the manner in which Eve operates and the exact measurement she performs, 
without knowledge of the answer she actually obtains, he will have to resort to a new description 
of the quantum system after it emerges from the "black box" — say some pg o^" p'l- This is where 
the detail of our work takes its start. Eve has gathered information and the state of the quantum 
system has changed in the process. 

The ingredients required formally to pose the question of the Introduction follow from the 
details of the model. We shall need: 

A. a convenient description of the most general kind of quantum measurement, 

B. a similarly convenient description of all the possible physical interactions that could give rise 
to that measurement, 

C. a measure of the information or inference power provided by any given measurement, 

D. a good notion by which to measure the distinguishability of mixed quantum states and a 
measure of disturbance based on it, and finally 

E. a "figure of merit" by which to compare the disturbance with the inference. 



'^This Section is based on a manuscript disseminated during the "Quantum Computation 1995" workshop held 
at the Institute for Scientific Interchange (Turin, Italy); as such, it contains a small redundancy with the previous 
Chapters. 
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Figure 4.1: Set-up for Inference-Disturbance Tradeoff 
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The way these ingredients are stirred together to make a good soup — very schematically — is the 
following. We first imagine some fixed measurement on the part of Eve and some particular 
physical interaction I used to carry out that measurement. (Note that I uniquely determines Ai, 
whereas in general A4 says very little about I.) Using the measures from ingredients C and D, we 
can then quantify the inference power 

lnf{po,pi;M) (4.1) 

accorded Eve and the necessary disturbance 

Dist(/5o,pi;T) (4.2) 

apparent to Bob. As the notation indicates, besides depending on po and pi, the inference power 
otherwise depends only on A4 and the disturbance only on I. Using the agreed upon figure of merit 
FOM, we arrive at a number that describes the tradeoff between inference and disturbance for the 
fixed measurement and interaction: 



FOM 



Inf(/>o,Pi;-M), Dist(/5o,/5i;X) . (4.3) 



Now the idea is to remove the constraint that the measurement A4 and the interaction I be fixed, 
to reexpress in terms of X, and to optimize 



FOM 



lni{po,pi;M{I)), Bistipo, pi;I) (4.4) 



over all measurement interactions I. This is the whole story. With the optimal such expression 
in hand, one automatically arrives at a tradeoff relation between inference and disturbance: for 
arbitrary A4 and I, expression ( [4.3| ) will be less or greater than the optimal such one, depending 
on its exact definition. 



4.2.2 The Formalism 

The first difficulty in our task is in finding a suitably convenient and useful formalism for describing 
Ingredients A and B. The most general measurement procedure allowed by the laws of quantum 
mechanics is, as already stated, first to have the system of interest interact with an ancilla and 
then to perform a standard (von Neumann) quantum measurement on the ancilla itself. Taken at 
face value, this description can be transcribed into mathematical terms as follows. The system of 
interest, starting out in some quantum state ps, is placed in conjunction with an ancilla prepared 
in a standard state pa. The conjunction of these two systems is described by the initial quantum 
state 

Psa. = Ps 8) Pa . (4.5) 
The interaction of the two systems leads to a unitary time evolution, 

/5sa U^PsJj ■ (4.6) 

(Note that the state of the system-plus-ancilla, by way of this interaction, will generally not remain 
in its original tensor-product form; rather the states of the two systems will become inextricably 
entangled and correlated.) Finally, a reproducible measurement on the ancilla is described via a 
set of orthogonal projection operators 1 <8) acting on the ancilla's Hilbert space: any particular 
outcome h is found with probability 

p{h) = iT({i®I{b)U\ps®p.)u), (4.7) 
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and the description of the system-plus-ancilla after the finding of this outcome must be updated 
according to the standard rules to 



pL\h = ® ^b)U\ps » P.)U{i » Ilf,) . (4.8) 

Thus the quantum state describing the system alone after finding outcome h is 

Ps|6 = tra(psa|b) > (4-9) 

where tra denotes a partial trace over the ancilla's Hilbert space. If one knows this measurement 
was performed but does not know the actual outcome, then the description given the quantum 
system wiU be rather 

p', = Y.P(b)p'^\i, = tra(f7nPs ® P.)U) . (4.10) 

b 

This face- value description of a general measurement gives — in principle — everything required 
of it. Namely it gives a formal description of the probabilities of the measurement outcomes and 
it gives a formal description of the system's state evolution arising from this measurement. The 
problem with this for the purpose at hand is that it focuses attention away from the quantum system 
itself, placing undue emphasis on the fact that there is an ancilla in the background. Unfortunately, 
it is this sort of thing that can make the formulation of optimization problems over measurements 
and interactions more difficult than it need be. On the brighter side, there is a way of getting 
around this particular deficiency of notation. This is accomplished by introducing the formalism of 



"effects and operations" [ |157| , |158| , 159 , 160 , 161 , [2q| , which we shall attempt to sketch presently. 



(An alternative formalism that may also be of use in this context is that of Mayers [156|.) 

Recently, more and more attention has been given to the (long known) fact that the probability 
formula, Eq. ( [4.7| ), can be written in a way that relies on the system Hilbert space alone with no 
overt reference to the ancilla (2^]. Namely, one can write 

pib) = tr,{p,Eb) , (4.11) 

where trg denotes a partial trace over the system's Hilbert space, by simply taking the operator Eb 
to be 

^b = traf(i®pa)f>(i®n6)f/^) . (4.12) 



The reason this can be viewed as making no direct reference to the ancilla is that it has been noted 
that any set of positive semi-definite operators {Eb} satisfying 

J2Eb = i (4.13) 



can be written in a form specified by Eq. ( 4.12 ). Therefore these sets of operators, known as 



positive operator-valued measures or POVMs, stand in one-to-one correspondence with the set 
of generalized quantum mechanical measurements. This correspondence gives us the freedom to 
exclude or to include explicitly the ancilla in our description of a measurement's statistics, whichever 
is the more convenient. 

A quick example illustrating why this particular representation of a measurement's outcomes 
statistics can be useful is the following. Consider the problem of trying to distinguish the quantum 
states po and pi from each other by performing some measurement whose outcome probabilities 
are po{b) if the state is actually po and pi{b) if the state is actually pi. A nice measure of the 
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distinguishability of the probabihty distributions generated by this measurement is their "statistical 
overlap" 1] 

b 

This quantity is equal to unity whenever the probability distributions are identical and equal to 
zero when there is no overlap at all between them. To get at a notion of how distinct pQ and pi 



can be made with respect to this measure, one would like to minimize expression ( 4.14 ) over all 
possible quantum measurements — that is to say, over all possible measurement interactions with all 
possible ancilla. Without the formalism of POVMs this would be quite a difficult task to pull off. 
With the formalism of POVMs, however, we can pose the problem as that of finding the quantity 



F{po,pi) = min V\/tr/5o-Ebytrpi^fe (4.15) 

where the minimization is over all sets of operators {Eb} satisfying Eb > and Eq. ( ^.13| ). This 
rendition of the problem makes it tractable. In fact, it can be shown [149] that 



E{po,pi) = tr V py^popy^ , (4.16) 

a quantity known in other contexts as (the square root of) Uhlmann's "transition probability" or 
"fidelity" for general quantum states [^, ^. The key to the proof of this is in relying as heavily 
as one can on the defining characteristic Eq. ( 4.13| ) of all POVMs. Namely, one uses the standard 



Schwarz inequality to lower bound Eq. ( |4.15 ) by an expression linear in the Eb, and then uses the 
completeness property Eq. (4.13) to sum those operators out of this bound. Finally, checking that 
there is a way of satisfying equality in each step of the process, Eq. (|1|) follows. 

To restate the point of this example: the tractability of our optimization problems may be 
greatly enhanced by the use of mathematical tools that focus on the essential formal characteristics 
of quantum measurements and not on their imagined implementation. (Other examples where 
great headway has been made by relying on the abstract defining properties of POVMs are the 



maximizing of Fisher information for quantum parameter estimation [122] and the bounding of 



quantum mutual information M, 147, 116. 



All that said, the use of POVMs only moves us partially toward our goal. This formalism in 



and of itself has nothing to say about the post-measurement states given by Eqs. ( [4.S| ) and ( 4.10 ); 
we still do not have a particularly convenient (or ancilla-independent) way of representing these 
state evolutions. It takes the formalism of "effects and operations" to complete the story. This fact 
is encapsulated by the following "representation" theorem of Kraus [^] . 

Theorem 4.1 Let {Eb} be the POVM derived from the measurement procedure described via 
Eqs. (4-^) through ( \4.1(\ ). Then there exists a set of (generally non-Hermitian) operators {Abi} 



acting on the system's Hilbert space such that 

Eb = E4A., (4.17) 



and the conditional and unconditional state evolutions under this measurement can be written as 

P'slb 



tVsipsEb) 

and 



' (4-18) 



pi = J2AbiPsAl. (4.19) 
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Moreover, for any set of operators {A^i} such that 



E4^* = i> (4-20) 

b,i 



there exists a measurement interaction whose statistics are described by the POVM in Eq. (•i-ll) 
and gives rise to the conditional and unconditional state evolutions of Eqs. ( 4-1^ o-^d ^.ld[) . 



(To justify the term "effects and operations" we sliould note that Kraus cahs a POVM {-Ef,} an 
effect and the state transformations described by the operators operations — the one given by 
Eq. ( |4.18 ) a selective operation and the one given by Eq. ( [4.19| ) a nonselective operation. In these 



notes we shah not make use of the term effect, but — lacking a more common term — will call the 
operator set {A\yi} an operation.) 

The way this theorem comes about is seen easily enough from Eq. ( |4.9| ). Let the eigenvalues of 
Pa be denoted by Aq, and suppose it has an associated eigenbasis \aa)- Then ps ® Pa. can be written 
as 

Pi^ = ^VK\aa)Ps{aa\^/K , (4.21) 

a 

and, just expanding Eq. (^), we have 

pL\b = ^E("/3l(i®n,)[/t(p,®pa)^7(i®nfe)|a/3) 

= -fttY. («/3Ki ® ^b)U^\aa)) Ps {{aa\U{i ^ ftfe)! a/3)\/A^) • 

' a/3 

(4.22) 

Equation ( [4.18| ) comes about by taking 

= v^(aa|?7(i ®flfe)|a^) (4.23) 



and lumping a and /3 into the single index i. Filling in the remainder of the theorem is relatively 
easy once this is realized. 

Kraus's theorem is the essential new input required to define the inference-disturbance tradeoff 
in such a way that it may have a tractable solution. For now, in our recipe Eq. (|4.3| ), we may 
replace the vague symbol M. (standing for a measurement) by a set of operators {Eb} and we 
may replace the symbol I (standing for a measurement interaction) by a set of operators {j4fej}. 
Moreover, we know how to connect these two sets, namely through Eq. ( [4.17 ). This reduces our 



problem to choosing a figure of merit FOM for the tradeoff and calculating the optimal quantity 



optimum < FOM 
{Am} [ 



Inf (/5o , /5i ; { ifei } ) , Dist (po , /5i ; { ifei } 



(4.24) 



where, again, "optimum" means either minimum or maximum depending upon the precise defini- 
tion of FOM. 

4.2.3 Tradeoff Relations 

We finally come to the point where we may attempt to build up various inference-disturbance 
relations. To this end we shall satisfy ourselves with the preliminary work of writing down a fairly 
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arbitrary relation. We do this mainly because it is somewhat easier to formulate than other more 
meaningful relations, but also because it appears to be useful for testing out certain simple cases. 

Before going on to details, however, perhaps we should say a little more about the significance of 
these ideas. It is often said that it is the Heisenberg uncertainty relations that dictate that quantum 
mechanical measurements necessarily disturb the measured system. That, though, is really not 
the case. The Heisenberg relations concern the inability to get hold of two classical observables 
simultaneously, and thus the inability to ascribe classical states of motion to quantum mechanical 
systems. This is a concern that has very little to do with the ultimate limits on what can happen to 
the quantum states themselves when information is gathered about their identity. The foundation 
of this approach differs from that of the standard Heisenberg relations in that it makes no reference 
to conjugate or complementary variables; the only elements entering these considerations are related 
to the quantum states themselves. In this way one can get at a notion of state disturbance that is 
purely quantum mechanical, making no reference to classical considerations. 

What does it really mean to say that the states are disturbed in and of themselves without 
reference to variables such as might appear in the Heisenberg relations? It means quite literally 
that Alice faces a loss of predictability about the outcomes of Bob's measurements whenever an 
information gathering eavesdropper intervenes. Take as an example the case where pQ and pi are 
nonorthogonal pure states. Then for each of these there exists at least one observable for which 
Alice can predict the outcome with complete certainty, namely the projectors parallel to po and 
pi, respectively. However, after Alice's quantum states pass into the "black box" occupied by Eve, 
neither Alice nor Bob will any longer be able to predict with complete certainty the outcomes of 
both those measurements. This is the real content of these ideas. 



A First "Trial" Relation 

The tradeoff relation to be described in this subsection is literally based on a simple inference 
problem — that of performing a single quantum measurement on one of two unknown quantum 
states and then using the outcome of that measurement to guess the identity of the state. The 
criterion of a good inference is that its expected probability of success be as high as it can possibly 
be. The criterion of a small disturbance is that the expected fidelity between the initial and final 
quantum states be as large as it can possibly be. There are, of course, many other quantities that 
we might have used to gauge inference power, e.g. mutual information, just as there are many 
other quantities that we might have used to gauge the disturbance. We fix our attention on the 
ones described here to get the ball rolling. Let us set up this problem in detail. 



Going back to the basic model introduced in Section 4.2.1, for this scheme, any measurement 
Eve performs can always be viewed as the measurement of a two-outcome POVM {Eq,Ei}. If the 
outcome corresponds to Eq, she guesses the true state to be po\ if it corresponds to Ei, she guesses 
the state to be pi. 

So, first consider a fixed POVM {Eq,Ei} and a fixed operation {^bj} (6 = 0, 1) consistent with 



it in the sense of Eq. ( 4.17 ). This measurement gives rise to an expected probability of success 
quantified by 

Ps = ^tr(po^o) + \^<hEi) ■ (4.25) 

That is to say, the expected probability of success for this measurement is the probability that po 
is the true state times the conditional probability that the decision will be right when this is the 
case plus a similar term for pi. (Here we have assumed the prior probabilities for the two states 



precisely equal.) Using Eq. (4.17) and the cyclic property of the trace, the success probability can 
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be reexpressed as 



(4.26) 



We shall identify this quantity as Inf^jAftj}^, the measure of the inference power given by this 
measurement. 

Now consider the quantum state Bob gets as the system passes out of the black box. If the 
original state was po, he obtains 

p'o = J2AuPoAI. (4.27) 

bA 



If the original state was pi, he obtains 



p[ = Y,AhiPiAl . 

b,i 



(4.28) 



(Eq. ( |4.19 ) is made use of rather than Eq. ( 4.1^ ) because it is assumed that Bob has no knowledge 
of Eve's measurement outcome.) The overall state disturbance by this interaction can be quantified 
in terms of any of a number of distinguishability measures for quantum states explored in Chapter 
3. Here we choose to make use of Uhlmann's "transition probability" or "fidelity", Eq. ( 4.16| ), to 
define a measure of clonability, 



C 



2 1 

+ 2 



(4.29) 



This quantity measures the extent to which the output quantum states "clone" the input states. In 
particular, the quantity in Eq. ( 4.29 ) — bounded between zero and one — attains a maximum value 
only when both states are left completely undisturbed by the measurement interaction. 
Reexpressing the clonability explicitly in terms of the operation we get 



C 



tr 



V 



A 




^biPs-^bi 



,1/2 
\Ps 



(4.30) 



(Here we have used the fact that F{pQ,pi) is symmetric in its arguments.) We shall identify this 
quantity as Dist^{j4;,j}^ , the measure of (non)disturbance given by this measurement. 

Now all that is left is to put the inference and disturbance into a common figure of merit by 
which to compare the two. There are a couple of obvious ways to do this. Since the idea is that, 
as the probability of success in the inference increases, the clonability decreases, and vice versa, 
we know that there should be nontrivial upper limits to both the sum and the products of Ps 
and C. Indeed the same must be true for an infinite number of monotonic functions of Ps and C. 
Here we shall be happy to focus on the sum as an appropriate figure of merit. Why? For no real 
reason other than that this combination looks relatively simple and is enough to demonstrate the 
principles involved. So, following through, we simply write down the tradeoff relation 



Ps + C < 



max 

{At,} 



^tr(i. 



■■Mi 



+ 



tr 



V 




V 



^ ^biPs^l^ 



,1/2 
ps 



J 



(4.31) 



Finally we are in possession of a well-defined mathematical problem awaiting a solution. If the 
right-hand side of Eq. ( [4.31| ) can be given an explicit expression, then the problem will have been 
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solved. Techniques for getting at this, however, must be a subject for future research. Presently 
we have no general solutions. 

We should point out that, though Eq. ( 4.31[ ) is the tightest bound of the form 

Ps + C < f{p^,pi) , (4.32) 

other, looser, bounds may also be of some utility — for instance, simply because they may be easier 
to derive. This comes about because the right-hand side of Eq. (4.31) is the actual maximum 
value of Ps + C; often it is easier to bound a quantity than explicitly to maximize it. The only 
requirement in the game is that the bound be nontrivial, i.e., smaller than the one that comes 
about by maximizing both Pg and C simultaneously, 

f{po,pi) < maxC -|- maxPg 
{Am} {Eb} 

= 1 + maxPg 
{Et} 



1 

1 + 2 



i + E%(r) 



(4.33) 



where Aj(f ) denotes the eigenvalues of the operator 

t = pi- po 



(4.34) 



and the prime on the summation sign signifies that the sum is taken only over the positive eigen- 
values. (See Section 3^ and Refs. |7^, ^.) The right-hand side of Eq. ( 4.33 ) would be the actual 
solution of Eq. (4.31) if and only if an inference measurement entailed no disturbance. In Eq. ( [4.33 ), 



max C 

{Am} 



(4.35) 



follows from the fact that there is a zero-disturbance measurement, namely the identity operation. 



An Example 

Let us work out a restricted example of this tradeoff relation, just to hint at the interesting insights 
it can give for concrete problems.0 In this example, the two initial states of interest are pure states 

Po = \ipo) ii^ol and p^ = {tp-^) {ip^l (4.36) 



separated in Hilbert space by an angle 9. (See inset to Figure [4.1| .) Eve bases her inference on the 
outcome of the POVM 

{no = |o)(o|, ni = |i)(i|} , (4.37) 

the projection operators onto the basis vectors symmetrically straddling the states. This mea- 
surement leads to the maximal probability of success for this inference problem |]7^ , 27]. The 
measurement interaction giving rise to this POVM will be assumed to be of the restricted class 
described by the operation 

Ab = Ubtlb , 6 = 0, 1 , (4.38) 
slight variation of this example is done in much greater detail and generality by Fuchs and Peres in Ref. 
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where the Ub are arbitrary unitary operators. Note that A^^Ai, = flf, for these operators, as must 
be the case by the definition of an operation. 

With this operation, the two states evolve to post-measurement states according to 



P'o 



and 



p'l 



UbtibPoflbUl 

b 

Koiv^o)p {(fo\o)m^o) + Kii^o)p {ui\i)mi 

cos2 e (j7o|0)(0|f7o^) + cos2(^ + e) (f7i|l)(l|f7[ 
cos2(e + 6) {Uo\0){0\U^o) + cos^^{Ui\l) ml 



(4.39) 



(4.40) 



where ^ is the angle between |6) and \ipb), 6 = 0, 1. This evolution has a simple interpretation: if 
Eve finds outcome 6 = 0, she sends on the quantum system in state 



m ^ Uo\0) ; 
if Eve finds outcome 6 = 1, she sends on the state 

I0i) • 



(4.41) 



(4.42) 



The (mixed) states — according to Bob's description — appearing in the outside world are then those 
given by Eqs. (|439|) and (glo| ). 

To calculate the disturbance given by these operations, we note a simplification to expressions 
for F{pq,Pq) and F{pi,p'i) due to the fact that the initial states are pure. Namely, 



F{pb,p'b) = tr JUbp'.Ub 



m\b) 



(4.43) 



If we further restrict Uq and Ui to be such that {|i;^o)) l^^i)} lie in the plane spanned by {|0), |1)} 
and determine equal angles with these basis vectors (see Figure then the clonability under 
this interaction works out to be 



C 



{0\p'o\0) + {l\p[\l] 



cos Ccos^cf) + cos2(e + 0)cos^(0 



(4.44) 



where cp is the angle between \ipb) and | (/){,). 

The problem is, of course, to find the optimal tradeoff between inference and disturbance for 
this measurement and interaction. Since the measurement POVM is fixed, this boils down to 
determining the angle (/> such that the clonability C is maximized. Setting dC/d(l) equal to zero 
and solving for 0, we find the least disturbing final states to be specified by the angle (j)o^ where 



1 

— arctan 
2 



1 + sin ( 
1 — sin ( 



+ cos 26 sin 261 



(4.45) 
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This angle ranges from 0° at 9 = 0° to its maximum value 6.99° at 6 = 27.73°, returning to 0° at 
9 = 90°. This means that, under the restrictions imposed here, the best strategy on the part of Eve 
for minimizing disturbance is not to send on the quantum states guessed in the inference, but rather 
a set of states with slightly higher overlap than the original states. This result points out that the 
(a priori reasonable) strategy of simply sending on the inferred states will only propagate the error 
in that inference; it is much smarter on Eve's part to attempt to hide that error by decreasing the 
probability that a wrong guess can lead to a detection of itself outside the boundaries of the black 
box. 

4.3 Noncommuting Quantum States Cannot Be Broadcast 

The fledgling field of quantum information theory |163| ] serves perhaps its most important role 
in delimiting wholly new classes of what is and is not physically possible. A particularly elegant 
example of this is the theorem |§, ^ that there are no physical means with which an unknown pure 
quantum state can be reproduced or copied. This situation is often summarized with the phrase, 
"quantum states cannot be cloned." Here, we demonstrate an impossibility theorem that extends 
and generalizes the pure-state no-cloning theorem to mixed quantum states]^ This theorem strikes 
very close to the heart of the distinction between the classical and quantum theories, because it 
provides a nontrivial physical classification of commuting versus noncommuting states. 

In this Section we ask whether there are any physical means — fixed independently of the identity 
of a quantum state — for broadcasting that quantum state onto two separate quantum systems. By 
broadcasting we mean that the marginal density operator of each of the separate systems is the 
same as the state to be broadcast. 

The pure-state "no-cloning" theorem [||, ^ prohibits broadcasting pure states. This is because 
the only way to broadcast a pure state \ip) is to put the two systems in the product state {tp) (^\xp), 
i.e., to clone {ip). Things are more complicated when the states are mixed. A mixed-state no-cloning 
theorem is not sufficient to demonstrate no-broadcasting, for there are many conceivable ways to 
broadcast a mixed state p without the joint state being in the product form p® p, the mixed-state 
analog of cloning; the systems might be correlated or entangled in such a way as to give the right 
marginal density operators. For instance, if the density operator has the spectral decomposition 

p = Y,\\b){h\ , (4.46) 

b 

a potential broadcasting state is the highly correlated joint state 

p = Y.m\h)m\ , (4.47) 
b 

which, though not of the product form p<Si p, reproduces the correct marginal density operators. 

The general problem, posed formally, is this. A quantum system AB is composed of two parts, 
A and B, each having an A^-dimensional Hilbert space. System A is secretly prepared in one state 
from a set A = {po, p^ of two quantum states. System B, slated to receive the unknown state, is 
in a standard quantum state S. The initial state of the composite system AB is the product state 
eg) S, where s = or 1 specifies which state is to be broadcast. We ask whether there is any 

^This Section represents a collaboration with Howard Barnum, Carlton M. Caves, Richard Jozsa, and Benjamin 
Schumacher. The presentation here is based largely on a manuscript submitted to Physical Review Letters; as such, 
it contains a small redundancy with the previous Chapters. Also note that this Section breaks from the notation of 
the rest of the dissertation in that operators are not distinguished by hats; for instance we now write p instead of p. 
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physical process iS, consistent with the laws of quantum theory, that leads to an evolution of the 
form 

® S ^ £{ps S) = , (4.48) 
where ps is any state on the A^^-dimensional Hilbert space AB such that 

ir a{Ps) = Ps and ti^{ps) = Ps ■ (4.49) 



Here tr^ and trg denote partial traces over A and B. If there is an £ that satisfies Eq. (|]49|) for 



both pq and pi, then the set A can be broadcast. A special case of broadcasting is the evolution 
specified by 

£{ps®T.) = Ps® Ps . (4.50) 

We reserve the word cloning for this strong form of broadcasting. 

The most general action £ on AB consistent with quantum theory is to allow AB to interact 
unitarily with an auxiliary quantum system C in some standard state and thereafter to ignore the 
auxiliary system |Q; that is, 

£{ps S) = trc {U{ps S T)C/t) , (4.51) 

for some auxiliary system C, some standard state T on C, and some unitary operator U on ABC. 
We show that such an evolution can lead to broadcasting if and only if pQ and pi commute. (In 
this way the concept of broadcasting makes a communication theoretic cut between commuting 
and noncommuting density operators, and thus between classical and quantum state descriptions.) 
We further show that A is clonable if and only if pQ and pi are identical or orthogonal, i.e., 

PoPi = . (4.52) 

To see that the set A can be broadcast when the states commute, we do not have to go to 
the extra trouble of attaching an auxiliary system. Since orthogonal pure states can be cloned, 
broadcasting can be obtained by cloning the simultaneous eigenstates of po and pi. Let h = 
1, . . . , A^, be an orthonormal basis for A in which both po and pi are diagonal, and let their spectral 
decompositions be 

Ps = Y.\sb\b){h\. (4.53) 

b 

Consider any unitary operator U on AB consistent with 

U\b)\l) = \h)\b) . (4.54) 

If we choose S = |1)(1| and let 

Ps = U{ps S)C/t = J2 >^sb\b)\b){b\{b\ , (4.55) 
b 



we immediately have that po and pi satisfy Eq. ( |4.49 ) 



The converse of this statement — that if A can be broadcast, po and pi commute — is more 
difficult to prove. Our proof is couched in terms of the concept of fidelity between two density 
operators. The fidelity F(po,pi) is defined by 



Fipo, Pi) = trV pI^^PipI^^ , (4.56) 



114 



where for any positive operator O, i.e., any Hermitian operator with nonnegative eigenvalues, O^/^ 
denotes its unique positive square root. (Note that Ref. Q defines fideUty to be the square of the 
present quantity.) Fidehty is an analogue of the modulus of the inner product for pure states ||^, ^ 
and can be interpreted as a measure of distinguishability for quantum states: it ranges between 
and 1, reaching if and only if the states are orthogonal and reaching 1 if and only if pQ = pi. It 
is invariant under the interchange <-> 1 and under the transformation 



Po 



UpoU^ 



and 



Pi 



UpiU^ 



(4.57) 



for any unitary operator U 



149|. Also, from the properties of the direct product, one has that 
F{po ® CTo, pi (Ti) = F{po,pi)F{ao, ai) . (4.58) 



Another reason F{pQ, pi) defines a good notion of distinguishability [^] is that it equals the 
minimal overlap between the probability distributions po{b) = tr(po-£'fe) and pi{b) = tic{piEb) gen- 
erated by a generalized measurement or positive operator-valued measure (POVM) {Ef,} p6| . That 

isirg, _ ^ ^ 

' ' (4.59) 



{Eb} 



F{po,Pi) =rainJ2y^'^(PoEb)Jtr{piEb) 



where the minimum is taken over all sets of positive operators {Eb} such that 



(4.60) 



This representation of fidelity has the advantage of being defined operationally in terms of mea- 
surements. We call a POVM that achieves the minimum in Eq. ( 4.591) an optimal POVM. 

One way to see the equivalence of Eqs. ( f4.59| ) and ( [4.56| ) is through the Schwarz inequality for 
the operator inner product tT{AB^): 



ti{AA^)tT{BB'<) > \tr{AB 



1^2 



with equality if and only if 



A = aB 



(4.61) 



(4.62) 



for some constant a. Going through this exercise is useful because it leads directly to the proof of 
the no-broadcasting theorem. Let {Eb} be any POVM and let U be any unitary operator. Using 
the cyclic property of the trace and the Schwarz inequality, we have that 



J2^tv{poEb)^tT{piEb) 
b 



Y.MUpl/'EbprU^)Jtr(prEbPr^ 



> j:H^p'o''Et^'E, 



1/2 pl/2 pl/2^1/2 



(4.63) 



> 



Y,tr{Up'J^Ebpl 



/2 



tr 



[UPo Pi 



/2 



(4.64) 



We can use the freedom in U to make the inequality as tight as possible. To do this, we recall 
I, ^ that 

max |tr (yO) | = tiVoH) , (4.65) 
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where O is any operator and the maximum is taken over all unitary operators V. The maximum 
is achieved only by those V such that 

VO = VdWe-'^ , (4.66) 

where (j) is an arbitrary phase. That there exists at least one such V is insured by the operator 
polar decomposition theorem p5[. Therefore, by choosing 



iSjj 1/2 1/2 ^ / 1/2 1/2 

e^Upo' p^' =\Jp^ pqp^ , (4.67) 

we get that 



J2^tT{poEf,)^tvipiEb) > F(po,Pi) . (4.68) 

b 

To find optimal POVMs, we consult the conditions for equality in Eq. ( 4.64| ). These arise from 
Step ( 4.63 ) and the one following it: a POVM {Eh} is optimal if and only if 

Up^El^' = p^pTe^ (4.69) 

and \J is rephased such that 

tr(c/pJ/'^b/5}/') =^6tr(piSb) > ^ ^6>0. (4.70) 
When p\ is invertible, Eq. ( [4. 69] ) becomes 

me]!'' = PbE]'' , (4.71) 

where 

A# -l/2rr 1/2 

M = p^ ' Upq' 



-1/2^ / 1/2 1/2 -1/2 / , „„x 

= Pi ypi PoPi Pi (4.72) 

is a positive operator. Therefore one way to satisfy Eq. ( 4.69| ) with > is to take E;, = \b){b\, 
where the vectors |6) are an orthonormal eigenbasis for M, with pb chosen to be the eigenvalue of 
1 6). When pi is noninvertible, there are still optimal POVMs. One can choose the first Ef, to be the 
projector onto the null subspace of pi] in the support of pi, i.e., the orthocomplement of the null 
subspace, pi is invertible, so one can construct the analogue of M and proceed as for an invertible 
pi. Note that if both pQ and pi are invertible, M is invertible. 

We begin the proof of the no-broadcasting theorem by using Eq. ( 4.591 ) to show that fidelity 
cannot decrease under the operation of partial trace; this gives rise to an elementary constraint on 
all potential broadcasting processes £. Suppose Eq. ( [4.49| ) is satisfied for the process f ofEq. (|]5lD, 
and let {Eb} denote an optimal POVM for distinguishing po and pi. Then, for each s, 



it follows that 



tr(p,(Eb«) 1)) = tT^[tT^{ps)Eb 

= ti^ipsEb); (4.73) 



Ea{po,Pi) = Y.y^^{poiEb^t))Jtr[pi{E, 
b ^ * 



> min V J tT{poEc) V tr{piEc) 
{Ec} c 

= F{po,pi). (4.74) 
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Here F^{pQ, pi) denotes simply the fidelity F(po, Pi), but the subscript A emphasizes that Fj^{pQ, pi) 
stands for the particular representation on the first line. The inequality in Eq. ( 4.74| ) comes from the 



fact that {Efy 1} might not be an optimal POVM for distinguishing pq and pi; this demonstrates 
the said partial trace property. Similarly since 



tr(p,(l®Efe)) = trB(trA(p.)^fe 



= tv^ipsEb), (4.75) 

it follows that 



Fb{po,Pi) = J2y^''{po(.^^^b))JtT(^pi{l0Ei,) 
b ^ ^ 

> F{po,pi). (4.76) 

where the subscript B emphasizes that -Fb(/Oo,/Oi) stands for the representation on the first line. 



On the other hand, we can just as easily derive an inequality that is opposite to Eqs. (4.74) and 



( [4.76 ). By the direct product formula and the invariance of fidelity under unitary transformations, 
F(po,Pi) = F(/9o®5]® T,pi »S»T) 

= F(c/(po®S0T)C/^^7(/9l® S0T)f/^) . (4.77) 
Therefore, by the partial-trace property, 

F{po,pi) < F(^trc(t/(po®S®T)C/t) ,trc(?7(pi®S®T)[/t)^ , (4.78) 
or, more succinctly, 

F{pQ,pi) < F[£{po(^^),£{piC^^)) = Fipo,pi) . (4.79) 



The elementary constraint now follows: the only way to maintain Eqs. ( |47^ ), ( |476| ), and (|^ 



is with strict equality. In other words, we have that if the set A can be broadcast, then there are 
density operators po and pi on AB satisfying Eq. ( [4.49| ) and 

F^{po,pi) = Fipo,pi) = Fb(po,Pi) • (4.80) 

Let us pause at this point to consider the restricted question of cloning. If A is to be clonable. 



there must exist a process £ such that ps = Ps® Ps for s = 0, 1. But then, by Eq. ( 4.80 ), we must 
have 



F(/)o,pi) = F(po®/5o,Pi ®/5i) = (F(po,/Ol))^ (4.81) 

which means that F{pQ,pi) = 1 or 0, i.e., po and pi are identical or orthogonal. There can 
be no cloning for density operators with nontrivial fidelity. The converse, that orthogonal and 
identical density operators can be cloned, follows, in the first case, from the fact that they can be 
distinguished by measurement and, in the second case, because they need not be distinguished at 
ah. 

Like the pure-state no-cloning theorem [^, ^, this no-cloning result for mixed states is a con- 
sistency requirement for the axiom that quantum measurements cannot distinguish nonorthogonal 
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states with perfect reliability. If nonorthogonal quantum states could be cloned, there would ex- 
ist a measurement procedure for distinguishing those states with arbitrarily high reliability: one 
could make measurements on enough copies of the quantum state to make the probability of a 
correct inference of its identity arbitrarily high. That this consistency requirement, as expressed 
in Eq. (^l.SOj ) , should also exclude more general kinds of broadcasting problems is not immediately 
obvious. Nevertheless, this is the content of our claim that Eq. (4.80) generally cannot be satisfied; 
any broadcasting process can be viewed as creating distinguishability ex nihilo with respect to 
measurements on the larger Hilbert space AB. Only for the case of commuting density operators 
does broadcasting not create any extra distinguishability. 

We now show that Eq. ( 4.80| ) implies that po and pi commute. To simplify the exposition, 
we assume that po and pi are invertible. We proceed by studying the conditions necessary for 
the representations Fa{pq,pi) and Fs{po,pi) in Eqs. ( 4.74 ) and ( 4.76| ) to equal F{pQ,pi). Recall 
that the optimal POVM {Ef,} for distinguishing po and pi can be chosen so that the POVM 
elements Ef, = \b){b\ are a complete set of orthogonal one-dimensional projectors onto orthonormal 



eigenstates of M. Then, repeating the steps leading from Eqs. ( 4.64 ) to ( 4.70 ), one finds that the 
necessary conditions for equality in Eq. ( 4.80| ) are that each 



Eb0l = (^6 1)^/2 



(4.82) 



Upl^\t^Eb) = abpt'^il^Eb) 



1 Eb = il<S) Eb) 



1/2/ 



1/2 



and each 
satisfy 
and 

where Ob and Pb are nonnegative numbers and U and V are unitary operators satisfying 



Vp'J^Eb^l) 



PhP^^Eb^l) , 



(4.83) 
(4.84) 
(4.85) 



-1/2 -1/2 

Upo Pi 



v> -1/2 -1/2 

yPo Pi 



-1/2- -1/2 
Pi POPl 



(4.86) 



Although pq and pi are assumed invertible, one cannot demand that po and pi be invertible — a 
glance at Eq. ( [4.55 ) shows that to be too restrictive. This means that U and V need not be the 
same. Also we cannot assume that there is any relation between Ob and Pb- 

The remainder of the proof consists in showing that Eqs. ( 4.84| ) through ( [4.86| ), which are 
necessary (though perhaps not sufficient) for broadcasting, are nevertheless restrictive enough to 
imply that po and pi commute. The first step is to sum over b in Eqs. (1^) and (|^). Defining 
the positive operators 



and 



we obtain 



G 



H 



fj -1/2 

Upo 



E 

b 



ab\b){b\ 



b 



Pb\b){b\ 



-1/2 
Pi 



(1®G) 



and 



VPo 



1/2 



-1/2 
Pi 



(4.87) 



(4.88) 



(4.89) 



(4.90) 
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The next step is to demonstrate that G and H are invertible and, in fact, equal to each other. 
Multiplying the two equations in Eq. ( 4.90| ) from the left by p^^f/^ and py^V^ respectively, and 
partial tracing the first over A and the second over B, we get 



Po = tr,(^V2^t-/2)G 



and 



Po 



(4.91) 



(4.92) 



Since, by assumption, po is invertible, it follows that G and H are invertible. Returning to 
Eq. ( 4.9C| ), multiplying both parts from the left by pl/'^ and tracing over A and B, respectively, we 
obtain 

ii J p'^^Upl'^) = piG (4.93) 



and 

iT^(f^^^Vpl'^) = piH. (4.94) 
Conjugating Eqs. ( [4.93 ) and ( [4.94| ) and inserting the results into the two parts of Eq. ( 4.92| ) yields 

po = GpiG and po = HpiH . (4.95) 

This shows that G = H, because these equations have a unique positive solution, namely the 
operator M of Eq. ([47^ ). This can be seen by multiplying Eq. ( |4.95D from the left and right by 
Pi to get 



1/2 1/2 
Pi POPl 



( 1/2^ 1 

[pi Gpi 



/2 



The positive operator f^l'^Gfhl'^ is thus the unique positive square root of Pi"^ p^Pi"^ . 
Knowing that 

G = H = M , 

we return to Eqs. ( 4.89 ) and ( 4.9C| ). The two, taken together, imply that 



If 1 6) and |c) are eigenvectors of M, with eigenvalues pi, and pc, Eq. (|4.98| ) implies that 



v^u{f^r\m) = ^(pf 16)1 



(4.96) 

(4.97) 
(4.98) 

(4.99) 



This means that p'J'^\b)\c) is zero or it is an eigenvector of the unitary operator V"^f7. In the latter 
case, since the eigenvalues of a unitary operator have modulus 1, it must be true that pb = pc- 
Hence we can conclude that 



1 /2 

Po |6)|c)=0 when pb pc ■ 
This is enough to show that M and po commute and hence 

[po,Pi] = . 



(4.100) 



(4.101) 
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To see this, consider the matrix element 



(6'|(Mpo-poM)|6) = if^b' - f^b){b'\po\b) 

= i^J'b' - f^b){b'\tr^{po)\b) 

= ipb'-Pb)J2i^'\{c\po\c)\b) . (4.102) 

c 

If fib = Pb', this is automatically zero. If, on the other hand, pb 7^ Pb', then the sum over c must 
vanish by Eq. ( 4.100 ). It follows that po and M commute. Hence, using Eq. ( 4.95 ), 

pipo = M-^poM-^po 
= poM-^poM-^ 

= PoPi ■ (4.103) 

This completes the proof that noncommuting quantum states cannot be broadcast. 
Note that, by the same method as above, 

p}/>)|c)=0 when Pb^Pc- (4.104) 

This condition, along with Eq. ( [4.100| ), determines the conceivable broadcasting states, in which 
the correlations between the systems A and B range from purely classical to purely quantum. For 
example, since po and pi commute, the states of Eq. ( 4.55 ) satisfy these conditions, but so do the 
perfectly entangled pure states 

\i^s) = Y.^b\b)\b) . (4.105) 

b 

However, not all such potential broadcasting states can be realized by a physical process £. The 
reason for this is quite intuitive: since the states po and pi commute, the eigenvalues Agb and Ai;, 
correspond to two different probability distributions for the eigenvectors. Any device that could 
produce the states in Eq. ( 4.105| ) would have to essentially read the mind of the person who set the 
(subjective) probability assignments — clearly this cannot be done. 

Nevertheless, this can be seen in a more formal way with a simple example. Suppose S{po) ^ 
S{pi), where 

S'(p) = -tr(plnp) (4.106) 

denotes the von Neumann entropy. In order for the states in Eq. ( [4.105| ) to come about, the unitary 
operator in Eq. ( 4.51| ) must be such that 

U{ps0j:0T)U^ = \i>s){i>s\^T, . (4.107) 

It then follows, by the unitary invariance of the von Neumann entropy and the fact that entropies 
add across independent subsystems, that 

S{po) - S{pi) = 5(To) - S(Ti) . (4.108) 

However, 

F(po,Pi) = |(V^o|^i)| (4.109) 

by construction. Therefore, by Eq. (|4.107| ), F(To,Ti) = 1. Hence Tq = Ti and it follows that 
Eq. (4.1081) cannot be satisfied. 
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In closing, we mention an application of this result. In some versions of quantum cryptography 



1 164 1, the legitimate users of a communication channel encode the bits and 1 into nonorthogonal 
pure states. This is done to ensure that any eavesdropping is detectable, since eavesdropping 
necessarily disturbs the states sent to the legitimate receiver | 165 |. If the channel is noisy, however, 
causing the bits to evolve to noncommuting mixed states, the detectability of eavesdropping is 
no longer a given. The result presented here shows that there are no means available for an 
eavesdropper to obtain the signal, noise and all, intended for the legitimate receiver without in 
some way changing the states sent to the receiver. Because the dimensionality of the density 
operators in the no-broadcasting theorem are completely arbitrary, this conclusion holds for all 
possible eavesdropping attacks. This includes those schemes where measurements are made on 
whole strings of quantum systems rather than the individual ones. 
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Chapter 5 

References for Research in Quantum 
Distinguishability and State 
Disturbance 

"Of course, serendipity played its role — 
some of the liveliest specimens . . . were 
found while looking for something else." 

— Nicolas Slonimsky 
Lexicon of Musical Invective 

This Chapter contains 528 references that may be useful in answering the following questions 
in all their varied contexts: "How statistically distinguishable are quantum states?" and "What 
is the best tradeoff between disturbance and inference in quantum measurement?" References are 
grouped under three major headings: Progress Toward the Quantum Problem; Information Theory 
and Classical Distinguishability; and Matrix Inequalities, Operator Relations, and Mathematical 
Techniques. 
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